mmm

A Cryptographic Library for the Motorola DSP56000

Stephen R. Dussb

Burton S. Kaliski Jr.

RSA Data Security Inc.

Redwood City, CA

Abstract. We &scribe a cryptographic library for the Motorola DSP56000 that provides harahre speed

yet softwcue&xibility. The library includes modular arithmetic, DES, message digest and other methods.

Of particular interest is an algorithm for modular multiplicationt hat interleaves multiplication with

Montgomerym odular reduction to give a veryf art implementationo f RSA.

Key words. Data Encryphh Standard (DES), Encryption hardware, Message digest, Modular mithmetic.

Montgomeryr eduction, Motorola DSP56oo0, Multiple-precisiona rithmeticR, SA.

1. Introduction

As cryptography becomes more widespread, fast yet flexible cryptographic tools are

becoming important. Experience with hardware tools has shown that speed often cannot

fully be malized unless all cryptographic methods of interest are implemented in

hardware. For example, digital signatures are often implemented with a message digest

followed by a public key encryption (as suggested first by [S]), so speeding up only the

public key encryption may not be sufficient. Nevertheless, hardware implementations of

many important yet nonstandard methods are hard to find.

We therefore propose that the right tool for many applications is not custom hardware but

a fast general-purpose processor.

We have recently developed a cryptographic library for one such processor, the Motorola

DSP56000 digital signal processor. The library includes the following methods:

Multiple-precision arithmetic. Several cryptosystems [12][16][18][19][261

involve integers hundreds of digits long, so this is a necessity.

Data Encryption Standard [7]. Though its security has been questioned [21, it

remains an important tool.

Message digest. This operation is essential to almost every signature scheme.

Flexibility is important as there is no widely accepted, secure, standard message

I.B. Damgard (Ed.): Advances in Cryptology - EUROCRYPT ‘90, LNCS 473, pp. 230-244, 1991

0 Springer-Verlag Berlin Heidelberg 1991

231

digest; our choices include FIPS 113 MAC [6] and RSA-MD2, both of which

were proposed for Internet elecmnic mail [22]. We are also considering RSAMD4

[25].

In evaluating various general-purpose processors we found that the DSP56000 is

especially well-suited because it can multiply two 24-bit integers and add the product to a

56-bit integer in 100 ns [14]. Such an operation is important not only in digital signal

processing but also in multiple-precision arithmetic. The 24-bit word size also matches

the 48-bit round keys of DES nicely. However, we expect that most of our results can be

applied on other general-purpose processors.

This paper is organized as follows. We begin by describing our algorithms for RSA and

DES. Then we present the design of a "crypto-accelerator card" for the IBM PC. Finally,

we summb the performance of the cryptographic library.

2. Related work

Work that motivated ours is Barrett's, Wiener's and Davio et ufs. Barrett observed the

effectiveness of digital signal processors for cryptography and presented an

implementation of RSA on Texas Instruments' TMS32010 [l]. Wiener developed a

general software implementation of RSA on the DSP56000 that achieves 10.2K bits/s for

512-bit modular exponentiation with the Chinese remainder theorem 141. An

implementation specific to 512-bit moduli is even faster [30]. Davio et a2 made

considerable progress in efficient techniques for DES [9], some of which we apply in our

implementation.

Among other recent work on fast cryptography are Buell and Ward's implementation of

multiple-precision arithmetic on a a Cray computer [5] and Laurichesse's fast

implementations of RSA on conventional processors [21]. A number of fast hardware

implementations can be found in Brickell's 1989 survey [4].

Currently the record for the fastest implementation of RSA is held by Shand, Bertin and

Vuillemin of Digital Equipment Corporation's Paris Research Laboratories, who have

achieved 226K bits/s for 508-bit modular exponentiation with the Chinese remainder

theorem [ 291.

3. RSA algorithm

We now describe our implementation of RSA on the Motorola DSP56000. This section

addresses the algorithms; performance is dealt with in Sec. 6.

In the RSA cryptosystem [26] one performs modular exponentutionr: computations of the

form C = ME mod N where C, M, E and N are multiple-precision integers. This

computation is central

apply to those as weL

232

to several other cryptosystems [12][16][18][19] so our results

S W g modular exponentiation has been of interest for some time, and there are a

number of speedups [3][20][24][27]. We focus on one particular aspect, the integration of

multiple-precision multiplication with modular reduction according to Montgomery's

method [23]. Our speedup is complementary to others that focus on reducing the number

of multiplications and reductions so ours and the others can be applied concurrently.

Our algorithm is most effective on a processor on which multiplication is fast relative to

shifting, for then the convolution-sum approach described below outperforms the

conventional shift-and-add method. We believe our algorithm will result in some speed

improvement on every processor, but given that it is a little more complicated than

conventional methods, the algorithm may not be justified on all processors.

3.1 Montgomery's method

We now outline Montgomery's method for modular arithmetic. Readers familiar with the

topic may skip to Sec. 3.2.

In Montgomery's method we represent residue classes in an unusual way and redehe

modular arithmetic within this representation. Specifically, let N be an integer (the

modulus) and let R be an integer relatively prime to N. We represent the residue class A

mod N as AK mod N and redefine modular multiplication as

MONTGOMERY-PRODU~(A$,N=, ARBR) -' mod N

It is not hard to verify that Montgomery multiplication in the new representation is

isomorphic to modular multiplication in the ordinary one:

MONTGOMERY-PRODUCTm(AoRd N,BR modN,N,R) = (AB)Rm od N

We can similarly redefine modular exponentiation as repeated Montgomery

multiplication. This "Montgomery exponentiation'' can be computed with all the usual

modular exponentiation speedups. TO compute ordinary modular exponentiation c = ME

mod N, we compute M' = MR mod N (ordinary modular reduction), C' = (M')ER1-E mod

N (Montgomery exponentiation), and C = C'R-l mod N (Montgomery reduction).

The practicality of Montgomery's method rests on the following nice theorem, which

leads directly to an algorithm for Montgomery multiplication.

233

Theorem 1 (Montgomery, 1985)

Let N and R be relatively-prime integers, and let K = 4 V - I mod R. Then for all integers

T, (T+MN)IR is an integer satisfymg

where M = TN' mod R.

Proof Equation 1 is straightforward. The fact that (T+MiV)/R is an integer can be shown

by substituting M.

If we choose the right R-say, a power of the base in which we represent mulhpleprecision

integers-then division by R and reduction modulo R are trivial. With such an

R Montgomery reduction is no more expensive than two multiple-precision products, and

we can make it even easier.

3.2 Computing the Montgomery product

We now describe OUT algorithm for the Montgomery product. For the discussion we will

let b be the base in which multiple-precision integers are represented. That is, we will

represent an integer A as a sequence of digits (uo,. . where

We will further require that all inputs to our algorithms can be represented in n base b

digits, and that R = b". In Sec. 3.3 we determine limitations on the individual digits ag, . . ., un-l.

We derive our algorithm by successive improvements, beginning with the following

algorithm taken directly from Theorem 1. (We note that our algorithm does not

"normalize" its output to the range [Od-11. Sec. 3.3 shows why.)

MONTGOMERY-PRODUC,TB(sA/' p)

1 N't-W1modR

2 T t A B

3 MtTIV'modR

4 TtT+MN - I

5 returnTIR

Improvement 1. Instead of computing all of M at once, let us compute one digit mi at a

time, add to T, and repeat. The resulting T may not be the same as in the original

algorithm but the effect of adding multiples of N will be: namely, to make T a multiple of

R. This is essentially the approach Montgomery gives for multiple-precision integers. We

note that this change allows us to compute ngl= N-l mod b instead of N'.

-

234

MONTC~MERY-I?RODW,BC 8~,(AR )

1 ng't-w-lmodb

2 TcAB

3

4 do mi t ring' mod b

5 T t T+mjV&

6 returnTIR

for i t 0 to n-l

Improvement 2. Now let us interleave multiplication and reduction. We note that

Montgomery reduction is intrinsically a right-to-left procedure. That is, mi depends only

on ti. So we can begin adding this multiple to T as soon as we know ti. This results in the

following algorithm:

MONTGOMERY-F%ODUCT,B",N(A,R )

1 no' t -no-' mod b

2 T c O

3 f o r i t o t o n - 1

4 do T+T+@b'

5

6 T t T+m$@

7 returnTIR

mi t ri%' mod b

Improvment 3. At this point we can begin to observe a potential difficulty for the

DSP56000. The operation T t T + a$bi-the basic shift-and-add operation-is likely to

break down into the following single-precision operations:

4.1 do x t t i

4.2

4.4

4.6 ti+n + X - (initial ti+n = 0)

forj t 0 to n-1

4.3 do X C X + U ~ ~ ~

4.5 x t x l b -right shift

ti+j t x mod b

These operations involve not only n single-precision multiplications but also n right

shifts. On many processors the "high part" and the "low part" of accumulators are

separately addressable and the right shift can be accomplished with move instructions.

This is true also on the DSP56000, but such shifting takes longer than a multiplication on

the DSP56000, which motivates us to minimize the number of right shifts. Happily, there

is a good way to avoid right shifts, and that is with the convolution-sum method of

multiplication. In this method instead of performing operations like T t T f agbi, we

perform operations like T c T + (& aibk-i)bk. These involve k+l multiplications but

only one shift. The fact that Montgomery reduction is intrinsically right-to-left helps us

again, and leads to our final algorithm.

MONTGOMERY-PRODU~",B( A," R)

1 ng't-q,-1modb

235

We expect that our final algorithm will generally be faster than the interleaved shift-andadd

version on most pmessors, because our algorithm has fewer right shifts, U(n) versus

O(n2). It also has fewer stores, again O(n) versus O(n2). (The number of other

operations-fecthes, multiplies, and adds-is essentially the same for both algorithms.)

Eowwer, w e note *at e*xr Zlgcr;~.!h x rmre caq!e?: IMP conm!.We also cote thzt

our algorithm accumulates intermediate results that are a factor of 2n larger in magnitude

than those in the shift-and-add algorithm, so we need a larger accumulator and addition

instructions that can handle the larger accumulator.

On most processors we can implement the larger accumulator with multiple registers and

the additions involving it with add-with-carry instructions. The DSP56000 is especially

well suited since its accumulator is eight bits longer than the largest product its ALU can

produce. Thus even without multiple registers or add-with-carry instructions the

DSP56000 can handle the intermediate results for n up to 128.

The extent to which our algorithm is faster depends mostly on the relative speeds of

multiplication and shifting. If multiplication is relatively slow then changes in the

number of shifts will have an insignficant effect on total execution time. For example, on

the Intel 80386 multiplication is an order of magnitude slower than shifting and we have

observed what appears to be at best a 10 percent improvement in execution time. But on

the DSP56000 the improvement is manyfold.

We conclude with a couple of remarks. First, we can derive a Montgomery squaring

algorithm MONTGOMERY-SQUARE(iAn JthVe, uRs)ua l way that is asymptotically 25

perceilt faster than the alieinauve ~ v ~ ~ ~ ~ T ~ G,?<,A!)~. E ~ - f - ~ o ~ Second, we can precompute no' = -no-' mod b once during a Montgomery

exponentiation since it depends only on the modulus N. Computing nd by a general

modular inverse algorithm such as extended Euclidean GCD would not be all that

expensive, since b is small. We have found instead (or rediscovered?) a very nice way to

compute the modular inverse in the special case that no is odd and b is a power of 2:

MoDuLAR-INVERSE(X,~)

1

3

4 do if xyi-I < 2i-1 mod 2i

- Computes r1 mod b for x odd and b a power of 2.

for i c 2 to lg b

2 Y l c - 1

236

5 then yi + Yi-1

6

7 returnylgb

The correctness of MODULAR-MRScaEn be verified by induction with the hypothesis

xyi P 1 mod 2'.

else yi c yi-l + 9-1

3.3 Representation of multiple-precision integers

We have not yet defined "base b representation" for the DSP56000, so we do so now.

The DSP56000 has a signed multiply instruction that multiplies two 24-bit two'scomplement

integers and adds their product to a 56-bit accumulator. Thus the logical

choices for "base b representation" are a sequence of 24-bit signed digits and a sequence

of 23-bit unsigned digits. A sequence of 24-bit unsigned digits is rather awkward with a

signed multiply instruction. Given that the 23-bit unsigned representation of an integer

would generally be longer than the 24-bit signed representation, we chose the signed

representation.

n u s the digits ai in @. 2 satisfy -223 I, ai I, 223-1.

We now prove our claim that MONTGOMERY-PROneDeUd CnoTt adjust its result to the

range [Oa-1] by showing that the redundant range [-N,N-l] can be maintained through

all intermediate calculations.



“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值