A Cryptographic Library for the Motorola DSP56000
Stephen R. Dussb
Burton S. Kaliski Jr.
RSA Data Security Inc.
Redwood City, CA
Abstract. We &scribe a cryptographic library for the Motorola DSP56000 that provides harahre speed
yet softwcue&xibility. The library includes modular arithmetic, DES, message digest and other methods.
Of particular interest is an algorithm for modular multiplicationt hat interleaves multiplication with
Montgomerym odular reduction to give a veryf art implementationo f RSA.
Key words. Data Encryphh Standard (DES), Encryption hardware, Message digest, Modular mithmetic.
Montgomeryr eduction, Motorola DSP56oo0, Multiple-precisiona rithmeticR, SA.
1. Introduction
As cryptography becomes more widespread, fast yet flexible cryptographic tools are
becoming important. Experience with hardware tools has shown that speed often cannot
fully be malized unless all cryptographic methods of interest are implemented in
hardware. For example, digital signatures are often implemented with a message digest
followed by a public key encryption (as suggested first by [S]), so speeding up only the
public key encryption may not be sufficient. Nevertheless, hardware implementations of
many important yet nonstandard methods are hard to find.
We therefore propose that the right tool for many applications is not custom hardware but
a fast general-purpose processor.
We have recently developed a cryptographic library for one such processor, the Motorola
DSP56000 digital signal processor. The library includes the following methods:
Multiple-precision arithmetic. Several cryptosystems [12][16][18][19][261
involve integers hundreds of digits long, so this is a necessity.
Data Encryption Standard [7]. Though its security has been questioned [21, it
remains an important tool.
Message digest. This operation is essential to almost every signature scheme.
Flexibility is important as there is no widely accepted, secure, standard message
I.B. Damgard (Ed.): Advances in Cryptology - EUROCRYPT ‘90, LNCS 473, pp. 230-244, 1991
0 Springer-Verlag Berlin Heidelberg 1991
231
digest; our choices include FIPS 113 MAC [6] and RSA-MD2, both of which
were proposed for Internet elecmnic mail [22]. We are also considering RSAMD4
[25].
In evaluating various general-purpose processors we found that the DSP56000 is
especially well-suited because it can multiply two 24-bit integers and add the product to a
56-bit integer in 100 ns [14]. Such an operation is important not only in digital signal
processing but also in multiple-precision arithmetic. The 24-bit word size also matches
the 48-bit round keys of DES nicely. However, we expect that most of our results can be
applied on other general-purpose processors.
This paper is organized as follows. We begin by describing our algorithms for RSA and
DES. Then we present the design of a "crypto-accelerator card" for the IBM PC. Finally,
we summb the performance of the cryptographic library.
2. Related work
Work that motivated ours is Barrett's, Wiener's and Davio et ufs. Barrett observed the
effectiveness of digital signal processors for cryptography and presented an
implementation of RSA on Texas Instruments' TMS32010 [l]. Wiener developed a
general software implementation of RSA on the DSP56000 that achieves 10.2K bits/s for
512-bit modular exponentiation with the Chinese remainder theorem 141. An
implementation specific to 512-bit moduli is even faster [30]. Davio et a2 made
considerable progress in efficient techniques for DES [9], some of which we apply in our
implementation.
Among other recent work on fast cryptography are Buell and Ward's implementation of
multiple-precision arithmetic on a a Cray computer [5] and Laurichesse's fast
implementations of RSA on conventional processors [21]. A number of fast hardware
implementations can be found in Brickell's 1989 survey [4].
Currently the record for the fastest implementation of RSA is held by Shand, Bertin and
Vuillemin of Digital Equipment Corporation's Paris Research Laboratories, who have
achieved 226K bits/s for 508-bit modular exponentiation with the Chinese remainder
theorem [ 291.
3. RSA algorithm
We now describe our implementation of RSA on the Motorola DSP56000. This section
addresses the algorithms; performance is dealt with in Sec. 6.
In the RSA cryptosystem [26] one performs modular exponentutionr: computations of the
form C = ME mod N where C, M, E and N are multiple-precision integers. This
computation is central
apply to those as weL
232
to several other cryptosystems [12][16][18][19] so our results
S W g modular exponentiation has been of interest for some time, and there are a
number of speedups [3][20][24][27]. We focus on one particular aspect, the integration of
multiple-precision multiplication with modular reduction according to Montgomery's
method [23]. Our speedup is complementary to others that focus on reducing the number
of multiplications and reductions so ours and the others can be applied concurrently.
Our algorithm is most effective on a processor on which multiplication is fast relative to
shifting, for then the convolution-sum approach described below outperforms the
conventional shift-and-add method. We believe our algorithm will result in some speed
improvement on every processor, but given that it is a little more complicated than
conventional methods, the algorithm may not be justified on all processors.
3.1 Montgomery's method
We now outline Montgomery's method for modular arithmetic. Readers familiar with the
topic may skip to Sec. 3.2.
In Montgomery's method we represent residue classes in an unusual way and redehe
modular arithmetic within this representation. Specifically, let N be an integer (the
modulus) and let R be an integer relatively prime to N. We represent the residue class A
mod N as AK mod N and redefine modular multiplication as
MONTGOMERY-PRODU~(A$,N=, ARBR) -' mod N
It is not hard to verify that Montgomery multiplication in the new representation is
isomorphic to modular multiplication in the ordinary one:
MONTGOMERY-PRODUCTm(AoRd N,BR modN,N,R) = (AB)Rm od N
We can similarly redefine modular exponentiation as repeated Montgomery
multiplication. This "Montgomery exponentiation'' can be computed with all the usual
modular exponentiation speedups. TO compute ordinary modular exponentiation c = ME
mod N, we compute M' = MR mod N (ordinary modular reduction), C' = (M')ER1-E mod
N (Montgomery exponentiation), and C = C'R-l mod N (Montgomery reduction).
The practicality of Montgomery's method rests on the following nice theorem, which
leads directly to an algorithm for Montgomery multiplication.
233
Theorem 1 (Montgomery, 1985)
Let N and R be relatively-prime integers, and let K = 4 V - I mod R. Then for all integers
T, (T+MN)IR is an integer satisfymg
where M = TN' mod R.
Proof Equation 1 is straightforward. The fact that (T+MiV)/R is an integer can be shown
by substituting M.
If we choose the right R-say, a power of the base in which we represent mulhpleprecision
integers-then division by R and reduction modulo R are trivial. With such an
R Montgomery reduction is no more expensive than two multiple-precision products, and
we can make it even easier.
3.2 Computing the Montgomery product
We now describe OUT algorithm for the Montgomery product. For the discussion we will
let b be the base in which multiple-precision integers are represented. That is, we will
represent an integer A as a sequence of digits (uo,. . where
We will further require that all inputs to our algorithms can be represented in n base b
digits, and that R = b". In Sec. 3.3 we determine limitations on the individual digits ag, . . ., un-l.
We derive our algorithm by successive improvements, beginning with the following
algorithm taken directly from Theorem 1. (We note that our algorithm does not
"normalize" its output to the range [Od-11. Sec. 3.3 shows why.)
MONTGOMERY-PRODUC,TB(sA/' p)
1 N't-W1modR
2 T t A B
3 MtTIV'modR
4 TtT+MN - I
5 returnTIR
Improvement 1. Instead of computing all of M at once, let us compute one digit mi at a
time, add to T, and repeat. The resulting T may not be the same as in the original
algorithm but the effect of adding multiples of N will be: namely, to make T a multiple of
R. This is essentially the approach Montgomery gives for multiple-precision integers. We
note that this change allows us to compute ngl= N-l mod b instead of N'.
-
234
MONTC~MERY-I?RODW,BC 8~,(AR )
1 ng't-w-lmodb
2 TcAB
3
4 do mi t ring' mod b
5 T t T+mjV&
6 returnTIR
for i t 0 to n-l
Improvement 2. Now let us interleave multiplication and reduction. We note that
Montgomery reduction is intrinsically a right-to-left procedure. That is, mi depends only
on ti. So we can begin adding this multiple to T as soon as we know ti. This results in the
following algorithm:
MONTGOMERY-F%ODUCT,B",N(A,R )
1 no' t -no-' mod b
2 T c O
3 f o r i t o t o n - 1
4 do T+T+@b'
5
6 T t T+m$@
7 returnTIR
mi t ri%' mod b
Improvment 3. At this point we can begin to observe a potential difficulty for the
DSP56000. The operation T t T + a$bi-the basic shift-and-add operation-is likely to
break down into the following single-precision operations:
4.1 do x t t i
4.2
4.4
4.6 ti+n + X - (initial ti+n = 0)
forj t 0 to n-1
4.3 do X C X + U ~ ~ ~
4.5 x t x l b -right shift
ti+j t x mod b
These operations involve not only n single-precision multiplications but also n right
shifts. On many processors the "high part" and the "low part" of accumulators are
separately addressable and the right shift can be accomplished with move instructions.
This is true also on the DSP56000, but such shifting takes longer than a multiplication on
the DSP56000, which motivates us to minimize the number of right shifts. Happily, there
is a good way to avoid right shifts, and that is with the convolution-sum method of
multiplication. In this method instead of performing operations like T t T f agbi, we
perform operations like T c T + (& aibk-i)bk. These involve k+l multiplications but
only one shift. The fact that Montgomery reduction is intrinsically right-to-left helps us
again, and leads to our final algorithm.
MONTGOMERY-PRODU~",B( A," R)
1 ng't-q,-1modb
235
We expect that our final algorithm will generally be faster than the interleaved shift-andadd
version on most pmessors, because our algorithm has fewer right shifts, U(n) versus
O(n2). It also has fewer stores, again O(n) versus O(n2). (The number of other
operations-fecthes, multiplies, and adds-is essentially the same for both algorithms.)
Eowwer, w e note *at e*xr Zlgcr;~.!h x rmre caq!e?: IMP conm!.We also cote thzt
our algorithm accumulates intermediate results that are a factor of 2n larger in magnitude
than those in the shift-and-add algorithm, so we need a larger accumulator and addition
instructions that can handle the larger accumulator.
On most processors we can implement the larger accumulator with multiple registers and
the additions involving it with add-with-carry instructions. The DSP56000 is especially
well suited since its accumulator is eight bits longer than the largest product its ALU can
produce. Thus even without multiple registers or add-with-carry instructions the
DSP56000 can handle the intermediate results for n up to 128.
The extent to which our algorithm is faster depends mostly on the relative speeds of
multiplication and shifting. If multiplication is relatively slow then changes in the
number of shifts will have an insignficant effect on total execution time. For example, on
the Intel 80386 multiplication is an order of magnitude slower than shifting and we have
observed what appears to be at best a 10 percent improvement in execution time. But on
the DSP56000 the improvement is manyfold.
We conclude with a couple of remarks. First, we can derive a Montgomery squaring
algorithm MONTGOMERY-SQUARE(iAn JthVe, uRs)ua l way that is asymptotically 25
perceilt faster than the alieinauve ~ v ~ ~ ~ ~ T ~ G,?<,A!)~. E ~ - f - ~ o ~ Second, we can precompute no' = -no-' mod b once during a Montgomery
exponentiation since it depends only on the modulus N. Computing nd by a general
modular inverse algorithm such as extended Euclidean GCD would not be all that
expensive, since b is small. We have found instead (or rediscovered?) a very nice way to
compute the modular inverse in the special case that no is odd and b is a power of 2:
MoDuLAR-INVERSE(X,~)
1
3
4 do if xyi-I < 2i-1 mod 2i
- Computes r1 mod b for x odd and b a power of 2.
for i c 2 to lg b
2 Y l c - 1
236
5 then yi + Yi-1
6
7 returnylgb
The correctness of MODULAR-MRScaEn be verified by induction with the hypothesis
xyi P 1 mod 2'.
else yi c yi-l + 9-1
3.3 Representation of multiple-precision integers
We have not yet defined "base b representation" for the DSP56000, so we do so now.
The DSP56000 has a signed multiply instruction that multiplies two 24-bit two'scomplement
integers and adds their product to a 56-bit accumulator. Thus the logical
choices for "base b representation" are a sequence of 24-bit signed digits and a sequence
of 23-bit unsigned digits. A sequence of 24-bit unsigned digits is rather awkward with a
signed multiply instruction. Given that the 23-bit unsigned representation of an integer
would generally be longer than the 24-bit signed representation, we chose the signed
representation.
n u s the digits ai in @. 2 satisfy -223 I, ai I, 223-1.
We now prove our claim that MONTGOMERY-PROneDeUd CnoTt adjust its result to the
range [Oa-1] by showing that the redundant range [-N,N-l] can be maintained through
all intermediate calculations.