Fpga integer factor
2010 International Conference on Field Programmable Logic and Applications
HighPerformance Integer Factoring with Reconfigurable Devices
Ralf Zimmermann, Tim Güneysu and Christof Paar
Horst Görtz Institute for ITSecurity, RuhrUniversity Bochum, Germany Email: {zimmermann,gueneysu,cpaar}@crypto.rub.de
Abstract—We present a novel FPGAbased implementation of the Elliptic Curve Method (ECM) for the factorization of mediumsized composite integers. More precisely, we demon strate an ECM implementation capable to determine prime factors of up to 2,424 151bit integers per second using a single Xilinx Virtex4 SX35 FPGA. Using this implementation on a cluster like the COPACOBANA is beneficial for attacking cryptographic primitives like the wellknown RSA cryptosystem with advanced methods such as the Number Field Sieve (NFS).
To provide this vast number of integer factorizations per FPGA, we make use of the available DSP blocks on each Virtex 4 device to accelerate lowlevel arithmetic computations. This methodology allows the development of a timearea efficient design that runs 24 ECM cores in parallel, implementing both phase 1 and phase 2 of the ECM. Moreover, our design is fully scalable and supports composite integers in the range from 66 to 236 bits without any significant modifications to the hardware. Compared to the implementation by Gaj et al., who reported an ECM design for the same Virtex4 platform, our improved architecture provides an advanced costperformance ratio which is better by a factor of 37.
Index Terms—Factorization, elliptic curve method, reconfig urable hardware, COPACOBANA.
I. INTRODUCTION
In 1987, the Elliptic Curve Method (ECM) was introduced by H. W. Lenstra [1] as a new method for integer factorization, generalizing the concept of Pollard's p−1 and Williams' p+1 method [2], [3]. Although the ECM is known not to be the fastest method for factorization with respect to asymptotical time complexity, it is widely used to factor composite numbers up to 200 bits due to its very limited requirements on memory.
The most prominent application that relies on the hardness of the factorization problem is the RSA cryptosystem. An attacker on RSA has to find the factorization of a composite number n which consists of two large primes p,q. More precisely, the RSA security parameter n is larger than 1024 bits and hence out of reach of the ECM. Up to date, such large bit sizes are preferably attacked with the most powerful methods known so far, such as the Number Field Sieve (NFS). However, the complex NFS1 involves the search of relations in which many midsized numbers need to be tested if they are "smooth", i.e., composed only of small prime factors not
1The NFS comprises of four steps, the polynomial selection, relation finding, a linear algebra step and finally the square root step. The relation finding step is most timeconsuming, taking roughly 90% of the runtime. For more information on the NFS refer to [4].
9780769541792/10 $26.00 © 2010 IEEE DOI 10.1109/FPL.2010.26
larger than a fixed boundary B. In this context, ECM is an important tool to determine the smoothness of such integers (i.e., if they can be factored into small primes), in particular due to its moderate resource requirements.
The fastest ECM implementations for retrieving factors of composite integers are softwarebased; a stateoftheart system is the GMPECM software published by P. Zimmer mann et al. [5] and has been extended for use with GPUs by Bernstein et al. [6]. As a promising alternative, efficient hardware implementations of the ECM were first proposed in 2005: Šimka et al. [7] demonstrated the feasibility to imple ment the ECM in reconfigurable hardware by presenting a first proofofconcept implementation. Their results were improved by Gaj et al. [8], [9], who also showed a complete hardware implementation of ECM phase 2. However, the lowlevel arith metic in these implementations were only implemented using straightforward techniques within the configurable logic which yet leaves room for further improvements. To fill this gap, de Meulenaer et al. [10] proposed an unrolled Montgomery multiplier based on a twodimensional pipeline on Xilinx Virtex4 FPGAs to accelerate the field arithmetic. However, due to limitations in area and the long pipeline design, their design only efficiently supports the first phase of the ECM.
Contribution: In this work we propose a novel ECM archi tecture for Xilinx Virtex FPGAs making use of DSP blocks for the computationally intensive arithmetic. Our focus is to accelerate the underlying field arithmetic of the ECM on FPGAs without sacrificing the option to combine both phase 1 and 2 in a single core. Thus, we adopt some highlevel decisions like memorymanagement and the use of SIMD instructions from [8] which also supports both phases on the same hardware. To improve the field arithmetic, we place fundamental arithmetic functions like adders and multipliers in embedded DSP blocks of modern FPGAs. For factoring large amounts of numbers, we finally describe our factor ization setup based on a variant of COPACOBANA (Cost Optimized PArallel COde Breaker)  a cluster system based on FPGAs [11], [12].
Outline: We start with a short review on the mathematical background and the concept of the ECM. In Section III we first describe the cluster system COPACOBANA, which represents the target platform of our work, and then discuss the architecture of an ECM core and its corresponding arithmetic components. Finally, we present our factorization results in Section IV before we conclude with Section V.
83
II. MATHEMATICAL BACKGROUND
We start with a brief introduction of the p − 1 method for
factorization to motivate the concept of the ECM. Let k ∈ N and n be the composite to be factored. Furthermore, let pn with p ∈ P, a ∈ Z and n be coprime, i.e., gcd(a,n) = 1. Now take Fermat's little Theorem [13, Fact 2.127] with ap−1 ≡ 1 mod p. The extension by the kmultiple of (p − 1) leading to ak(p−1) ≡ 1 mod p holds as well since (ap−1)k ≡ 1k =1modp. Then, with e=k(p−1) we have
ak(p−1) =ae ≡1modp ⇒ ae−1≡0modp
⇒ p(ae−1)
Hence, if ae ̸=1modn we know
1 < gcd(ae −1,n) < n.
In this case, we found a nontrivial divisor of n. However, we are not able to compute e = k(p − 1) without knowledge of p. Thus, we assume that p − 1 decomposes solely into small prime factors pi less than a defined bound B1 (in this case, p−1 is called B1smooth). Now we choose e as product of all prime powers pri lower than B1 and hope that e is a multiple of p − 1:
∏ ⌊logp (B1)⌋
e= pi (1)
i
pi∈P;pi<B1
By computing d = gcd(ae − 1, n) we finally hope to find a divisor d of n. Now, we transfer this idea to elliptic curves E. Let us assume q to be a factor of n and E(Zq) is B1smooth so that e, according to the construction in Equation (1), is a multiple of q − 1. Note that point multiplication by E(Zq) (or multiples of E(Zq)) returns the point at infinity, e.g., Q = eP = O for an arbitrary point P and resulting point Q. Recall that the resulting point Q = O implies a prior impossible division by zQ with zQ ≡ 0 mod q. Note that we actually perform all point operations in E(Zn) since we do not know q. Hence, we compute Q = eP in E(Zn) and hope to yield a point Q with coordinate zQ ̸= 0 mod n but zQ ≡ 0modq. Then, the factor q of n is obtained by q = gcd(zQ,n).
From an algorithmic point of view, we can discover a factor q of n as follows: in the first phase of the ECM, we compute Q = eP, where e is a product of prime powers pi ≤ B1 with appropriately chosen smoothness bounds. The second phase increases the success probability by checking for each prime B1 < p ≤ B2 whether pQ reduces to the neutral element in E(Zq). This second phase increases the chance to find a factor of n as shown by [14]. Algorithm 1 summarizes all necessary steps of both phases of the ECM.
If we consider a single curve only, the properties of the method are closely related to those of Pollard's (p−1)method that can fail by returning a trivial divisor, such as 1 or n. The advantage of ECM is apparent with the possibility of choosing another curve (and thus group order) after each unsuccessful
Algorithm 1 The Elliptic Curve Method
Input: Compositen=f1·f2·...···fn. Output: Factor fi of n.
8
8
8
4
4
4
1: 2: 3:
4: 5: 6: 7: 8: 9:
10: 11: 12: 13: 14: 15: 16: 17: 18: 19:
Phase 1:
Choose arbitrary curve E(Zn) and random point P ∈ E(Zn) ̸= O. Choose smoothness bounds B1, B2 ∈ N.
Compute e = ∏ p logpi (B1) pi∈P;pi<B1 i
Compute Q = eP = (xQ; yQ; zQ). Compute d = gcd(zQ, n).
Phase 2:
Sets:=1. foreachprimepwithB1<p≤B2do
Compute pQ = (xpQ : ypQ : zpQ).
Compute d = d · zpQ. end for
Compute fi = gcd(d, n). if1<fi <nthen
A nontrivial factor d is found.
return fi else
Restart from Step 2 in Phase 1.
end if
⌊⌋
trial, increasing the probability of retrieving factors of n. If the final gcd of the product s and n satisfies 1 < gcd(s, n) < n, a factor is found. Note that the parameters B1 and B2 control the probability of finding a divisor q. More precisely, if the order of P factors into a product of prime powers (each ≤ B1) and at most one additional prime between B1 and B2, the prime factor q is discovered.
Note that in the last years, most designs for the ECM preferably use elliptic curves in Montgomery or Edwards form [15], [16] since it allows to compute points with simpli fied formulas, e.g., by only using the x and zcoordinates. In this work, we will use Montgomery curves since they allow highly regular computation patterns which are favorable for hardware circuits.
III. DESIGNING THE ECM SYSTEM
To factor a large amount of midsized numbers as required by the NFS, we chose COPACOBANA as the target plat form for our ECM implementation. COPACOBANA [11] is a parallel lowcost cluster system that can be populated with up to 128 FPGAs in a single housing. More precisely, we modified the COPACOBANA architecture for this work to host Virtex4 FPGA devices distributed among 16 plugin cards which are all connected to single backplane. Each Virtex4 SX device provides 192 DSP blocks which can be cascaded to support pipelined, arithmetic intensive computations as shown in Section IIIB. Note that latest Spartan6 FPGAs (which also come with a large number of DSP blocks) are another promising alternative due to their costperformance ratio as target device. However, they are currently not available in sufficiently large quantities.
A. Architecture of an ECM Core
We will now describe the layout of our implementation. As it adopts some of the design choices by Gaj et al., we will begin with a short summary of the basic system and then outline our improvements.
Instruction ROM
17
Workspace Memory
17
17
17
MMUL1 Input
17 1
MulMod Input
MMUL2 Input
71
7
17
MADDSUB
MMUL1
7
MMUL2
1
System
24
ECM Core
ECM Core
ECM Core
ECM Core
ECM Core
ECM Core
ECM Core
Core
ECM Core
ECM Core
ECM Core
100 MHz (SYSCLK)
ECM Core
ECM Core
200 MHz (ECCCLK)
ECM Core
Control Unit
ctrl
ctrl
Elliptic Curve Processor
Synch RAM
ECM Core
CtrlUnit
Synch RAM
Synch RAM
based ECP, as shown in Figure 2, for improved performance. The processor contains a large instruction ROM, controlling four dedicated memory blocks and three underlying arithmetic units. It implements the sequences for point addition, point doubling, one step of the Montgomery Ladder [15] as well as optimized instructions to process the second phase.
To implement the second phase of the ECM, we chose the standard continuation method with a different precomputation and memory layout to that used in [8]. To continue with the previous point result from phase 1 and additional primes, we first need to store these primes in a primetable. Furthermore, we use a timememory tradeoff (dependent on a parameter D = 210) to store differences between individual primes represented using an interval scanning technique. For this, we calculate a set of 24 fixed multiples of the resulting point Q0 as well as the boundary points DQ0 and MMIN(B1)DQ0 (MMIN(B1) ≥ 2). As the parameter D is fixed for a given implementation, we computed an addition chain2 which returns all points for the chosen B1. Using the suggested boundary B1 = 960, our addition chain needs to perform 30 point additions and doubles 7 times instead of 1 point doubling and 52 point additions as reported in [8].
The computation suggested by Gaj et al. for phase 2 uses two multiplication and one addition/subtraction unit to perform subsequently three modular multiplications and corresponding additions per round after an initial delay of one leading multi plication. We remark that the initial delay exists in every round (and thus MM AX − MM I N + 1 times in total). Additionally, in case if this is an odd number, the final computation requires special handling since only one point is used instead of two. Algorithm 2 shows an improved version of the standard continuation which provides remedy to both issues and only requires a single initial multiplication per run.
We work on the variables r, s and d and delay the computation of r − s to the next iteration. Thus, we require an initialization (line 4) using the first prime we found in the primetable. Afterwards, these variables are updated in each iteration. For round m, we find two distinct points aQ0 and bQ0 and process these subsequent values (lines 11 to 14). At the end of the iteration, we renew the variables d, r and s. As
2Note that the chain must ensure that for any point addition R+Q we have already calculated (and stored) the corresponding difference point R − Q.
Fig. 2: Layout of the Elliptic Curve Processor
Fig. 1: Layout of the ECM system and ECM core
Figure 1 shows the components of our design: It consists of a buffered system bus which is fed by a data request scheduler which manages the I/O communication and timing dependencies and a fixed number ECM cores. The COPA COBANA is connected to a host PC which takes care of the necessary pre and postprocessing. Depending on the size of the communication interface (not included in the figure), up to 24 ECM cores can be placed on a Virtex4 SX35 FPGA with the number of DSPs as the limiting factor.
Previous versions of COPACOBANA did not require a fast communication interface since applications performing an ex haustive key search rarely exchange data. However, parameters and results for ECM operations need to be transferred between the cluster and the hostPC. In particular, some operations like gcd computations and the generation of elliptic curves and corresponding parameters are costly in hardware and are performed on the host system.
In the lower part of Figure 1, we outline an ECM core in detail. It consists of a control unit, the Elliptic Curve Processor (ECP) and a memory buffer to synchronize between clock domains. Each core is divided into two clock domains: The control unit operates at a 100 MHz clock, controlling the ECP and setting up the necessary computation parameters, while the ECP itself runs at a 200 MHz clock performing the arithmetic intensive operations. In order to cross these domains (i.e., for receiving the ECM parameters or sending the results) a dualported memory buffer is used, each port operating at the respective clock speed.
As our FPGA has a total of 192 memory modules, each ECM core contains a local memory storing parameters and instructions for phase 1 and 2 (control unit) and each ECP contains a dedicated instruction ROM. This redundancy is beneficial to reduce the critical path length in order to enable the fast ECP clock domain.
Gaj et al. suggested an efficient way to implement both phases of the ECM using two modular multiplication and one modular addition/subtraction unit (cf. [8]). We adopt this proposal for an highlevel abstraction but implement a DSP
8
8
8
5
5
5
Data Request Scheduler
Algorithm 2 Computation of d in Phase 2
Input: Bit table primetable and points R = (Rx :: Rz) = MMINDQ0, Q = DQ0 and jQ0
Output: Output d of second phase
1 AÆA
1 4
8
8 8
DSP2: PCIN + b(i) * A(j)
X+
P 48
48
1: 2: 3: 4: 5: 6: 7: 8: 9:
10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26:
m ← MMIN , pt ← primetablem i ← findNextPosition(pt) A←jiQ0 r←AzRx,s←AxRz,d←1 while m ≤ MMAX do
i ← findNextPosition(pt) if i ̸= END then
A ← jiQ0
i ← findNextPosition(pt) if i ̸= END then
B ← jiQ0 t←AzRx,u←AxRz,d1 ←r−s d ̃ ← d · d 1 , r ← B z R x , d 2 ← t − u d ← d ̃ · d 2 , s ← B x R z
else
r←AzRx,s←AxRz,d1 ←r−s d ← d · d1
end if end if
if i = END then
m ← m + 1, pt ← primetablem R←R+Q
end if end while
d1 ←r−s d←d·d1
BÆB SÆ
8
8
DSP1: S(i, j) + q(i) * M(j)
X+
soon as we reach the end of primetablem, we have one of two cases: either we found only aQ0, or no point since the last iteration. We cannot process aQ0 with the first prime found in primetablem+1, because we need to update the point R for the next iteration. Therefore, we update the variables r, s and d using two multiplications (lines 16 and 17). The second case does not provide a problem and requires no special handling. In both cases we increase m and modify R according to the basic algorithm. When we reach m = MM AX , we process the values remaining in r, s and return d (lines 25 and 26).
B. Modular Arithmetic Units
In this section we discuss the implementation of modular arithmetic units for elliptic point computations relying on the fast arithmetic using DSP blocks.
1) Modular Multiplication Unit: The modular multiplica tion A · B mod M implemented in this work is a variant of the modular multiplication algorithm without trial division proposed by Montgomery [17]. Orup published several mod ifications for this method to simplify the quotient handling in [18]. Straightforward quotient handling is particularly impor tant as it allows consecutive arithmetic operations in the DSP blocks which require a rather fixed realignment of operands. In this context, Orup's improvements only demand for word wise (i.e., kbit) multiplications, additions and shifting which are natively supported by the DSP blocks of Virtex4 FPGAs. Hence, using Orup's variant Modular Multiplication with Quo tient Pipelining, all arithmetic operations can be performed by sequential instructions in DSP blocks without need of additional resources in the configurable logic.
MÆA1
QÆB1
Fig. 3: DSP configuration of the modular multiplication unit We will now shortly revise Orup's modification to the
Montgomery multiplication. It needs n + 2 + d steps for a fixed delay parameter d. We set the pipelining parameter d = 0 and simplify the algorithm as shown in Algorithm 3.
Algorithm 3 Modular Multiplication with Quotient Pipelining
and d = 0
Input: Modulus M > 2 with gcd(M,2) = 1, k,n ∈ N : 4M ̃ < 2kn. Mˇ := (M ̃ + 1) div 2k, M ̃ := (M′ mod 2k)M, with (−MM′) mod 2k = 1. Integers A,B in Montgomery Form with 2knR−1 ≡ 1 modMand0≤A≤2M ̃,0≤B≤2M ̃.bi referstotheithkbit block of B.
Output: Sn+2 ≡ ABR−1(mod M) and 0 ≤ Sn+2 < 2Mˇ S0 := 0
for i := 0 to n do
qi := Si mod 2k
Si+1 := Si div 2k + qi · Mˇ + biA;
end for
Sn+2 := 2kSn+1
The definition of Mˇ represents the relation between the number of blocks n and the size of the modulus M: M < 2k(n−1)−2. To use dedicated DSP cores, we use the static word size of k = 17. In turn, if we implement n = 10, we allocate 170 bits for each intermediate result to perform computations on a 151bit modulus. This overhead is required because the algorithm does not compute the final reduction to save time.
Our implementation uses a fixed pipeline of three DSP cores to sequentially calculate the result. Using a sequential approach that processes individual word segments provides several advantages for lowlevel optimizations, e.g., the effi cient use of counters to address input and output memory.
Figure 3 shows the configuration of these three DSP cores. The index i denotes the round while the index j represents the current 17bit block of the input. Note that an additional feedback path which connects the result block to the qi input is not included in the figure for the sake of clarity. Our multiplier needs a total of n(n+2)+6 clock cycles to compute and return the result.
2) Modular Addition/Subtraction Unit: The modular addi tion/subtraction A+B mod M is no critical operation in terms of clock cycles. However, the proposed formulae for elliptic
8
8
8
6
6
6
+
A 18 [35]
DSP3: Shift(P) + (A:B + C)
B 18 [34..17]
(shared) C [16..0] 48
[A:B]
48 P
(shared) C
curve computations require that two modular additions must not exceed the time of one modular multiplication. Again, we use the arithmetic power of DSP blocks for implementing the modular addition, significantly reducing the required amount of configurable logic resources.
results are based on ECM parameters B1 = 960, B2 =57,000, D = 210 and the fixed scalar e of 1,323 bit. These parameters were used in most of the related work and allow a fair comparison.
Our ECM system is designed to allow straightforward scalability and to factor composite integers with smaller or larger bit sizes. It supports bit length of L = 17(b−1)−2 with b = 5..15 without notably hardware changes. This corresponds to an input bit range from 66 to 236 bits. To adjust the bit length, we have to place and route the ECM system with modified generics increasing or decreasing the internal size of the output buffers of the elliptic curve processor without changing the rest of the architecture. Afterwards, we generate a new instruction sequence and load it into the instruction ROM.
TABLE I: Implementation results of our ECM system factor ing integers up to 151 bit.
Table I shows the results for the block size b = 10, factoring integers up to 151 bit in 4.73 ms (phase 1) and 5.12 ms (phase 2). This corresponds to 5,064 computations per second for the first phase and 2,424 for phase 1+2.
We compare our result to the Virtex4 implementation by Gaj et al. and de Meulenaer et al. in Table II. For a fair comparison, we scaled our design to support the same bit lengths as reported in [9], [10]. For a consistent pricing model, we refer to the official Xilinx distributor Avnet Electronics Marketing4. The information are as of January, 2010 for single purchase without discount, rounded up to the next dollar.
Note that comparing to the architecture proposed by Gaj et al. is possible since performance figures for same platform (Xilinx Virtex4) and scope of implementation (ECM phase 1 and phase 2) are available. However, we like to emphasize that this does not work for the design by de Meulenaer et al. The latter design only implements phase 1 of the ECM what provides a higher performance at the cost of a reduced success probability to find a prime factor of the composite integer. Hence, this design is listed for completeness but does not allow an expressive comparison.
Considering the reported work by Gaj et al. our design provides the significant better costperformance ratio by a factor of 37 for ECM phase 1+2 on the same class of Virtex4 FPGAs. This ratio is likely to be improved when using cost optimized Spartan6 devices instead of Virtex4 FPGAs.
4see http://em.avnet.com
8
8
cin
[A:B]
DSP2: Reduction
48
48
8
8
cin
[A:B] ±
48
DSP1: Calculation
M[33..18]Æ A 1
1
ÆA1
A[17..0] B[17..0]
S [34]
�
[34]
R
M[17..0]Æ B
A[33..18] B[33..18]
FPGA Type Number of Cores
Virtex4 SX 35 24
Arithmetic Clock Frequency # Cycles (Phase I)
# Cycles (Phase II)
# Cycles (Phase I + II) Time (Phase I)
Time (Phase II)
200 MHz 945,746 1,024,029 1,969,775 4.73 ms 5.12 ms
Phase I per Second Phase I + II per Second
5,064 2,424
ÆB1
Fig. 4: DSP configuration of the modular addition/subtraction unit.
Figure 4 depicts the DSP configuration for the addi tion/subtraction unit, consisting of two pipelined and cascaded DSP cores. Each core supports the concatenation of the (signed) 18 bit inputs A and B. We use this to load two 17bit blocks of the first value Aj into the result register and use the feedback path to add two 17bit blocks of the second value Bj . This requires a dynamic operation mode configuration, as the DSP alternates between loading and adding the feedback path3 .
After processing the block j, the first DSP passes the intermediate result Rj = Aj ± Bj(+ CIN) to the second DSP for reduction Sj = Rj ∓ Mj (+ CIN). Both operations work in parallel, only shifted by one clock cycle. At the end of the computation, the unit evaluates the borrow bit of the subtraction (either by the main or reduction DSP) and determines the correct output, i.e., R or S, respectively.
As we decided to feed results from a modular multiplication directly to this unit, we have to process n blocks with 17bit each and subsequently reduce them by 2Mˇ . With respect to the pipelining registers, this leads to a total of 2n + 3 clock cycles to compute and return the result.
IV. RESULTS
We present the results of our implementation based on Xil inx ISE Foundation 11 for synthesis and implementation. All
3Note that a static operation mode concept is possible and requires less logic, but does not work on Virtex4 FPGAs very effectively. On Virtex4 FPGAs, two DSP cores share one input port. This restriction does not exist for other FPGA families, such as Virtex5 or Virtex6.
8
8
8
7
7
7
Gaj et al.
This work
Factor
de Meulenaer et al.
This work
Factor
FPGA Device FPGA Cost Supported Bits
V4LX20011 7,564 $ 198
V4SX3510 468 $ 202
0.06 1.02
V4SX2510 298 $ 135
V4SX3510 468 $ 134
1.57 0.99
Max. ECM Cores Max. Frequency
24 104 MHz
24 200 MHz
1.00 1.92
1 220 MHz
24 200 MHz
24 0.91
Cycles for Addition Cycles for Multiplication Cycles for Phase 1
Time for Phase 1
41 216 1,666,500 16 ms
29 201 1,473,596 7.37 ms
0.71 0.93 0.88 0.46
1
1 13,750 63 μs
21 105 797,288 3.99 ms
21 105 59.98 63.78
# Phase 1 per Second
# Phase 1 per Second per 100 $
# Phase 1+2 per Second
# Phase 1+1 per Second per 100 $
1,448 19 696 9
3,240 692 1,560 333
2.24 36.42 2.24 37
16,000* 5,369* 

6,000 1,282 2,880 615
0.375 0.239 

TABLE II: Comparison of our results using b = 13 (L = 202 bit) and b = 9 (L = 134 bit) with Virtex4 LX 200 (Gaj et al.) and
Virtex4 SX 25 (de Meulenaer et al.) implementations. Note ECM phase 1 with reduced success probability.
V. CONCLUSION
In this work, we presented an implementation of phase 1 and 2 of the Elliptic Curve Method for integer factorization. Our efficient design allows to run up to 24 parallel ECM cores on a single Virtex4 SX35 FPGA and can be scaled to factor 66 to 236 bit integers without considerable hardware changes. Picking up the ideas by Šimka et al. and Gaj et al. we implemented the underlying arithmetic function using the support of DSP blocks and used additional optimizations such as an advanced memory management and improved addition chains for point multiplication.
For 151bit integers, our implementation can provide 2,424 factorizations per second per FPGA running both phases of the ECM. Given an interface fast enough to provide the I/O data, our implementation scales to a total of 310,272 factorizations per second on a fully equipped COPACOBANA cluster populated with 128 Virtex4 devices. Compared to the implementation by Gaj et al. who reported a similar design for the same Virtex4 platform, our architecture provides a costperformance ratio which is better by a factor of 37.
ACKNOWLEDGEMENTS
The work described in this paper has been supported by the Federal Office for Information Security (BSI), Germany, and the European Commission through the ICT program under contract ICT2007216676 ECRYPT II.
REFERENCES
[1] H. Lenstra, "Factoring Integers with Elliptic Curves," Annals of Math ematics, vol. 126, pp. 649–673, 1987.
[2] J. M. Pollard, "Theorems on factorization and primality testing," in Proceedings of the Cambridge Philosophy Society, 1974, pp. 521–528.
[3] H. C. Williams, "A p + 1 method of factoring," Mathematics of
Computation, vol. 39, pp. 225–234, 1982.
[4] A. K. Lenstra and H. W. J. Lenstra, Eds., The Development of the
Number Field Sieve, ser. LNM. SpringerVerlag, 1993, vol. 1554.
[5] P. Zimmermann, A. Kruppa, B. Gladman, J. Papadopoulos, J. Feltin, L. Fousse, P. Gaudry, R. Cosset, and T. Kleinjung, "GMPECM Web
site," June 2010, available at https://gforge.inria.fr/projects/ecm/.
[6] D. J. Bernstein, T.R. Chen, C.M. Cheng, T. Lange, and B.Y. Yang, "ECM on graphics cards," in Eurocrypt 2009, ser. LNCS, vol. 5479,
that figures with asterisk solely represent a design implementing
[7] M. Šimka, J. Pelzl, T. Kleinjung, J. Franke, C. Priplata, C. Stahlke, M. Drutarovský, V. Fischer, and C. Paar, "Hardware Factorization Based on Elliptic Curve Method," in Proceedings of the IEEE Symposium on FieldProgrammable Custom Computing Machines (FCCM 2005), J. Arnold and K. L. Pocek, Eds. IEEE Computer Society, April 1820 2005, pp. 107–116.
[8] K. Gaj, S. Kwon, P. Baier, P. Kohlbrenner, H. Le, M. Khaleeluddin, and R. Bachimanchi, "Implementing the Elliptic Curve Method of Factoring in Reconfigurable Hardware," in Proceedings of the Workshop on Cryptograpic Hardware and Embedded Systems (CHES 2006), ser. LNCS, L. Goubin and M. Matsui, Eds., vol. 4249. SpringerVerlag, 2006, pp. 119–133.
[9] K. Gaj, S. Kwon, P. Baier, P. Kohlbrenner, H. Le, M. Khaleeluddin, R. Bachimanchi, and M. Rogawski, "Areatime efficient implementation of the elliptic curve method of factoring in reconfigurable hardware for application in the number field sieve," IEEE Transactions on Computers, vol. 99, no. PrePrints, 2009.
[10] G. de Meulenaer, F. Gosset, M. M. de Dormale, and J.J. Quisqater, "Integer Factorization Based on Elliptic Curve Method: Towards Bet ter Exploitation of Reconfigurable Hardware," in Proceedings of the IEEE Symposium on FieldProgrammable Custom Computing Machines (FCCM 2007). IEEE Computer Society, 2007, pp. 197–206.
[11] T. Güneysu, T. Kasper, M. Novotný, C. Paar, and A. Rupp, "Cryptanal ysis with COPACOBANA," IEEE Transactions on Computers, vol. 57, no. 11, pp. 1498–1513, November 2008.
[12] T. Güneysu, C. Paar, G. Pfeiffer, and M. Schimmler, "Enhancing COPA COBANA for advanced applications in cryptography and cryptanalysis," in Proceedings of the Conference on Field Programmable Logic and Applications (FPL 2008), 2008, pp. 675–678.
[13] A. J. Menezes, P. C. van Oorschot, and S. A. Vanstone, Handbook of Applied Cryptography. New York: CRC Press, 1996.
[14] R. P. Brent, "Some Integer Factorization Algorithms Using Elliptic Curves," Australian Computer Science Communications, vol. 8, pp. 149– 163, 1986.
[15] P. L. Montgomery, "Speeding the Pollard and elliptic curve methods of factorization," Mathematics of Computation, vol. 48, no. 177, pp. 243–264, 1987.
[16] H. M. Edwards, "A normal form for elliptic curves," Bulletin American Mathematical Society, vol. 44, no. 3, p. 393, 2007, http://www.ams.org/bull/2007 44 03/S0273 0979 07 01153 6/
S0273 0979 07 01153 6.pdf.
[17] P. L. Montgomery, "Modular Multiplication Without Trial Division," Mathematics of Computation, vol. 44, no. 170, pp. 519–521, April 1985.
[18] H. Orup, "Simplifying quotient determination in highradix modular multiplication," in ARITH '95: Proceedings of the 12th Symposium on
2009, pp. 483–501.
8
8
8
8
8
Computer Arithmetic. 1995, p. 193.
Washington, DC, USA: IEEE Computer Society,
8
Results Timing Design
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.187.7244&rep=rep1&type=pdf
HighPerformance Integer Factoring with Reconfigurable Devices
Ralf Zimmermann, Tim Güneysu and Christof Paar
Horst Görtz Institute for ITSecurity, RuhrUniversity Bochum, Germany Email: {zimmermann,gueneysu,cpaar}@crypto.rub.de
Abstract—We present a novel FPGAbased implementation of the Elliptic Curve Method (ECM) for the factorization of mediumsized composite integers. More precisely, we demon strate an ECM implementation capable to determine prime factors of up to 2,424 151bit integers per second using a single Xilinx Virtex4 SX35 FPGA. Using this implementation on a cluster like the COPACOBANA is beneficial for attacking cryptographic primitives like the wellknown RSA cryptosystem with advanced methods such as the Number Field Sieve (NFS).
To provide this vast number of integer factorizations per FPGA, we make use of the available DSP blocks on each Virtex 4 device to accelerate lowlevel arithmetic computations. This methodology allows the development of a timearea efficient design that runs 24 ECM cores in parallel, implementing both phase 1 and phase 2 of the ECM. Moreover, our design is fully scalable and supports composite integers in the range from 66 to 236 bits without any significant modifications to the hardware. Compared to the implementation by Gaj et al., who reported an ECM design for the same Virtex4 platform, our improved architecture provides an advanced costperformance ratio which is better by a factor of 37.
Index Terms—Factorization, elliptic curve method, reconfig urable hardware, COPACOBANA.
I. INTRODUCTION
In 1987, the Elliptic Curve Method (ECM) was introduced by H. W. Lenstra [1] as a new method for integer factorization, generalizing the concept of Pollard's p−1 and Williams' p+1 method [2], [3]. Although the ECM is known not to be the fastest method for factorization with respect to asymptotical time complexity, it is widely used to factor composite numbers up to 200 bits due to its very limited requirements on memory.
The most prominent application that relies on the hardness of the factorization problem is the RSA cryptosystem. An attacker on RSA has to find the factorization of a composite number n which consists of two large primes p,q. More precisely, the RSA security parameter n is larger than 1024 bits and hence out of reach of the ECM. Up to date, such large bit sizes are preferably attacked with the most powerful methods known so far, such as the Number Field Sieve (NFS). However, the complex NFS1 involves the search of relations in which many midsized numbers need to be tested if they are "smooth", i.e., composed only of small prime factors not
1The NFS comprises of four steps, the polynomial selection, relation finding, a linear algebra step and finally the square root step. The relation finding step is most timeconsuming, taking roughly 90% of the runtime. For more information on the NFS refer to [4].
9780769541792/10 $26.00 © 2010 IEEE DOI 10.1109/FPL.2010.26
larger than a fixed boundary B. In this context, ECM is an important tool to determine the smoothness of such integers (i.e., if they can be factored into small primes), in particular due to its moderate resource requirements.
The fastest ECM implementations for retrieving factors of composite integers are softwarebased; a stateoftheart system is the GMPECM software published by P. Zimmer mann et al. [5] and has been extended for use with GPUs by Bernstein et al. [6]. As a promising alternative, efficient hardware implementations of the ECM were first proposed in 2005: Šimka et al. [7] demonstrated the feasibility to imple ment the ECM in reconfigurable hardware by presenting a first proofofconcept implementation. Their results were improved by Gaj et al. [8], [9], who also showed a complete hardware implementation of ECM phase 2. However, the lowlevel arith metic in these implementations were only implemented using straightforward techniques within the configurable logic which yet leaves room for further improvements. To fill this gap, de Meulenaer et al. [10] proposed an unrolled Montgomery multiplier based on a twodimensional pipeline on Xilinx Virtex4 FPGAs to accelerate the field arithmetic. However, due to limitations in area and the long pipeline design, their design only efficiently supports the first phase of the ECM.
Contribution: In this work we propose a novel ECM archi tecture for Xilinx Virtex FPGAs making use of DSP blocks for the computationally intensive arithmetic. Our focus is to accelerate the underlying field arithmetic of the ECM on FPGAs without sacrificing the option to combine both phase 1 and 2 in a single core. Thus, we adopt some highlevel decisions like memorymanagement and the use of SIMD instructions from [8] which also supports both phases on the same hardware. To improve the field arithmetic, we place fundamental arithmetic functions like adders and multipliers in embedded DSP blocks of modern FPGAs. For factoring large amounts of numbers, we finally describe our factor ization setup based on a variant of COPACOBANA (Cost Optimized PArallel COde Breaker)  a cluster system based on FPGAs [11], [12].
Outline: We start with a short review on the mathematical background and the concept of the ECM. In Section III we first describe the cluster system COPACOBANA, which represents the target platform of our work, and then discuss the architecture of an ECM core and its corresponding arithmetic components. Finally, we present our factorization results in Section IV before we conclude with Section V.
83
II. MATHEMATICAL BACKGROUND
We start with a brief introduction of the p − 1 method for
factorization to motivate the concept of the ECM. Let k ∈ N and n be the composite to be factored. Furthermore, let pn with p ∈ P, a ∈ Z and n be coprime, i.e., gcd(a,n) = 1. Now take Fermat's little Theorem [13, Fact 2.127] with ap−1 ≡ 1 mod p. The extension by the kmultiple of (p − 1) leading to ak(p−1) ≡ 1 mod p holds as well since (ap−1)k ≡ 1k =1modp. Then, with e=k(p−1) we have
ak(p−1) =ae ≡1modp ⇒ ae−1≡0modp
⇒ p(ae−1)
Hence, if ae ̸=1modn we know
1 < gcd(ae −1,n) < n.
In this case, we found a nontrivial divisor of n. However, we are not able to compute e = k(p − 1) without knowledge of p. Thus, we assume that p − 1 decomposes solely into small prime factors pi less than a defined bound B1 (in this case, p−1 is called B1smooth). Now we choose e as product of all prime powers pri lower than B1 and hope that e is a multiple of p − 1:
∏ ⌊logp (B1)⌋
e= pi (1)
i
pi∈P;pi<B1
By computing d = gcd(ae − 1, n) we finally hope to find a divisor d of n. Now, we transfer this idea to elliptic curves E. Let us assume q to be a factor of n and E(Zq) is B1smooth so that e, according to the construction in Equation (1), is a multiple of q − 1. Note that point multiplication by E(Zq) (or multiples of E(Zq)) returns the point at infinity, e.g., Q = eP = O for an arbitrary point P and resulting point Q. Recall that the resulting point Q = O implies a prior impossible division by zQ with zQ ≡ 0 mod q. Note that we actually perform all point operations in E(Zn) since we do not know q. Hence, we compute Q = eP in E(Zn) and hope to yield a point Q with coordinate zQ ̸= 0 mod n but zQ ≡ 0modq. Then, the factor q of n is obtained by q = gcd(zQ,n).
From an algorithmic point of view, we can discover a factor q of n as follows: in the first phase of the ECM, we compute Q = eP, where e is a product of prime powers pi ≤ B1 with appropriately chosen smoothness bounds. The second phase increases the success probability by checking for each prime B1 < p ≤ B2 whether pQ reduces to the neutral element in E(Zq). This second phase increases the chance to find a factor of n as shown by [14]. Algorithm 1 summarizes all necessary steps of both phases of the ECM.
If we consider a single curve only, the properties of the method are closely related to those of Pollard's (p−1)method that can fail by returning a trivial divisor, such as 1 or n. The advantage of ECM is apparent with the possibility of choosing another curve (and thus group order) after each unsuccessful
Algorithm 1 The Elliptic Curve Method
Input: Compositen=f1·f2·...···fn. Output: Factor fi of n.
8
8
8
4
4
4
1: 2: 3:
4: 5: 6: 7: 8: 9:
10: 11: 12: 13: 14: 15: 16: 17: 18: 19:
Phase 1:
Choose arbitrary curve E(Zn) and random point P ∈ E(Zn) ̸= O. Choose smoothness bounds B1, B2 ∈ N.
Compute e = ∏ p logpi (B1) pi∈P;pi<B1 i
Compute Q = eP = (xQ; yQ; zQ). Compute d = gcd(zQ, n).
Phase 2:
Sets:=1. foreachprimepwithB1<p≤B2do
Compute pQ = (xpQ : ypQ : zpQ).
Compute d = d · zpQ. end for
Compute fi = gcd(d, n). if1<fi <nthen
A nontrivial factor d is found.
return fi else
Restart from Step 2 in Phase 1.
end if
⌊⌋
trial, increasing the probability of retrieving factors of n. If the final gcd of the product s and n satisfies 1 < gcd(s, n) < n, a factor is found. Note that the parameters B1 and B2 control the probability of finding a divisor q. More precisely, if the order of P factors into a product of prime powers (each ≤ B1) and at most one additional prime between B1 and B2, the prime factor q is discovered.
Note that in the last years, most designs for the ECM preferably use elliptic curves in Montgomery or Edwards form [15], [16] since it allows to compute points with simpli fied formulas, e.g., by only using the x and zcoordinates. In this work, we will use Montgomery curves since they allow highly regular computation patterns which are favorable for hardware circuits.
III. DESIGNING THE ECM SYSTEM
To factor a large amount of midsized numbers as required by the NFS, we chose COPACOBANA as the target plat form for our ECM implementation. COPACOBANA [11] is a parallel lowcost cluster system that can be populated with up to 128 FPGAs in a single housing. More precisely, we modified the COPACOBANA architecture for this work to host Virtex4 FPGA devices distributed among 16 plugin cards which are all connected to single backplane. Each Virtex4 SX device provides 192 DSP blocks which can be cascaded to support pipelined, arithmetic intensive computations as shown in Section IIIB. Note that latest Spartan6 FPGAs (which also come with a large number of DSP blocks) are another promising alternative due to their costperformance ratio as target device. However, they are currently not available in sufficiently large quantities.
A. Architecture of an ECM Core
We will now describe the layout of our implementation. As it adopts some of the design choices by Gaj et al., we will begin with a short summary of the basic system and then outline our improvements.
Instruction ROM
17
Workspace Memory
17
17
17
MMUL1 Input
17 1
MulMod Input
MMUL2 Input
71
7
17
MADDSUB
MMUL1
7
MMUL2
1
System
24
ECM Core
ECM Core
ECM Core
ECM Core
ECM Core
ECM Core
ECM Core
Core
ECM Core
ECM Core
ECM Core
100 MHz (SYSCLK)
ECM Core
ECM Core
200 MHz (ECCCLK)
ECM Core
Control Unit
ctrl
ctrl
Elliptic Curve Processor
Synch RAM
ECM Core
CtrlUnit
Synch RAM
Synch RAM
based ECP, as shown in Figure 2, for improved performance. The processor contains a large instruction ROM, controlling four dedicated memory blocks and three underlying arithmetic units. It implements the sequences for point addition, point doubling, one step of the Montgomery Ladder [15] as well as optimized instructions to process the second phase.
To implement the second phase of the ECM, we chose the standard continuation method with a different precomputation and memory layout to that used in [8]. To continue with the previous point result from phase 1 and additional primes, we first need to store these primes in a primetable. Furthermore, we use a timememory tradeoff (dependent on a parameter D = 210) to store differences between individual primes represented using an interval scanning technique. For this, we calculate a set of 24 fixed multiples of the resulting point Q0 as well as the boundary points DQ0 and MMIN(B1)DQ0 (MMIN(B1) ≥ 2). As the parameter D is fixed for a given implementation, we computed an addition chain2 which returns all points for the chosen B1. Using the suggested boundary B1 = 960, our addition chain needs to perform 30 point additions and doubles 7 times instead of 1 point doubling and 52 point additions as reported in [8].
The computation suggested by Gaj et al. for phase 2 uses two multiplication and one addition/subtraction unit to perform subsequently three modular multiplications and corresponding additions per round after an initial delay of one leading multi plication. We remark that the initial delay exists in every round (and thus MM AX − MM I N + 1 times in total). Additionally, in case if this is an odd number, the final computation requires special handling since only one point is used instead of two. Algorithm 2 shows an improved version of the standard continuation which provides remedy to both issues and only requires a single initial multiplication per run.
We work on the variables r, s and d and delay the computation of r − s to the next iteration. Thus, we require an initialization (line 4) using the first prime we found in the primetable. Afterwards, these variables are updated in each iteration. For round m, we find two distinct points aQ0 and bQ0 and process these subsequent values (lines 11 to 14). At the end of the iteration, we renew the variables d, r and s. As
2Note that the chain must ensure that for any point addition R+Q we have already calculated (and stored) the corresponding difference point R − Q.
Fig. 2: Layout of the Elliptic Curve Processor
Fig. 1: Layout of the ECM system and ECM core
Figure 1 shows the components of our design: It consists of a buffered system bus which is fed by a data request scheduler which manages the I/O communication and timing dependencies and a fixed number ECM cores. The COPA COBANA is connected to a host PC which takes care of the necessary pre and postprocessing. Depending on the size of the communication interface (not included in the figure), up to 24 ECM cores can be placed on a Virtex4 SX35 FPGA with the number of DSPs as the limiting factor.
Previous versions of COPACOBANA did not require a fast communication interface since applications performing an ex haustive key search rarely exchange data. However, parameters and results for ECM operations need to be transferred between the cluster and the hostPC. In particular, some operations like gcd computations and the generation of elliptic curves and corresponding parameters are costly in hardware and are performed on the host system.
In the lower part of Figure 1, we outline an ECM core in detail. It consists of a control unit, the Elliptic Curve Processor (ECP) and a memory buffer to synchronize between clock domains. Each core is divided into two clock domains: The control unit operates at a 100 MHz clock, controlling the ECP and setting up the necessary computation parameters, while the ECP itself runs at a 200 MHz clock performing the arithmetic intensive operations. In order to cross these domains (i.e., for receiving the ECM parameters or sending the results) a dualported memory buffer is used, each port operating at the respective clock speed.
As our FPGA has a total of 192 memory modules, each ECM core contains a local memory storing parameters and instructions for phase 1 and 2 (control unit) and each ECP contains a dedicated instruction ROM. This redundancy is beneficial to reduce the critical path length in order to enable the fast ECP clock domain.
Gaj et al. suggested an efficient way to implement both phases of the ECM using two modular multiplication and one modular addition/subtraction unit (cf. [8]). We adopt this proposal for an highlevel abstraction but implement a DSP
8
8
8
5
5
5
Data Request Scheduler
Algorithm 2 Computation of d in Phase 2
Input: Bit table primetable and points R = (Rx :: Rz) = MMINDQ0, Q = DQ0 and jQ0
Output: Output d of second phase
1 AÆA
1 4
8
8 8
DSP2: PCIN + b(i) * A(j)
X+
P 48
48
1: 2: 3: 4: 5: 6: 7: 8: 9:
10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26:
m ← MMIN , pt ← primetablem i ← findNextPosition(pt) A←jiQ0 r←AzRx,s←AxRz,d←1 while m ≤ MMAX do
i ← findNextPosition(pt) if i ̸= END then
A ← jiQ0
i ← findNextPosition(pt) if i ̸= END then
B ← jiQ0 t←AzRx,u←AxRz,d1 ←r−s d ̃ ← d · d 1 , r ← B z R x , d 2 ← t − u d ← d ̃ · d 2 , s ← B x R z
else
r←AzRx,s←AxRz,d1 ←r−s d ← d · d1
end if end if
if i = END then
m ← m + 1, pt ← primetablem R←R+Q
end if end while
d1 ←r−s d←d·d1
BÆB SÆ
8
8
DSP1: S(i, j) + q(i) * M(j)
X+
soon as we reach the end of primetablem, we have one of two cases: either we found only aQ0, or no point since the last iteration. We cannot process aQ0 with the first prime found in primetablem+1, because we need to update the point R for the next iteration. Therefore, we update the variables r, s and d using two multiplications (lines 16 and 17). The second case does not provide a problem and requires no special handling. In both cases we increase m and modify R according to the basic algorithm. When we reach m = MM AX , we process the values remaining in r, s and return d (lines 25 and 26).
B. Modular Arithmetic Units
In this section we discuss the implementation of modular arithmetic units for elliptic point computations relying on the fast arithmetic using DSP blocks.
1) Modular Multiplication Unit: The modular multiplica tion A · B mod M implemented in this work is a variant of the modular multiplication algorithm without trial division proposed by Montgomery [17]. Orup published several mod ifications for this method to simplify the quotient handling in [18]. Straightforward quotient handling is particularly impor tant as it allows consecutive arithmetic operations in the DSP blocks which require a rather fixed realignment of operands. In this context, Orup's improvements only demand for word wise (i.e., kbit) multiplications, additions and shifting which are natively supported by the DSP blocks of Virtex4 FPGAs. Hence, using Orup's variant Modular Multiplication with Quo tient Pipelining, all arithmetic operations can be performed by sequential instructions in DSP blocks without need of additional resources in the configurable logic.
MÆA1
QÆB1
Fig. 3: DSP configuration of the modular multiplication unit We will now shortly revise Orup's modification to the
Montgomery multiplication. It needs n + 2 + d steps for a fixed delay parameter d. We set the pipelining parameter d = 0 and simplify the algorithm as shown in Algorithm 3.
Algorithm 3 Modular Multiplication with Quotient Pipelining
and d = 0
Input: Modulus M > 2 with gcd(M,2) = 1, k,n ∈ N : 4M ̃ < 2kn. Mˇ := (M ̃ + 1) div 2k, M ̃ := (M′ mod 2k)M, with (−MM′) mod 2k = 1. Integers A,B in Montgomery Form with 2knR−1 ≡ 1 modMand0≤A≤2M ̃,0≤B≤2M ̃.bi referstotheithkbit block of B.
Output: Sn+2 ≡ ABR−1(mod M) and 0 ≤ Sn+2 < 2Mˇ S0 := 0
for i := 0 to n do
qi := Si mod 2k
Si+1 := Si div 2k + qi · Mˇ + biA;
end for
Sn+2 := 2kSn+1
The definition of Mˇ represents the relation between the number of blocks n and the size of the modulus M: M < 2k(n−1)−2. To use dedicated DSP cores, we use the static word size of k = 17. In turn, if we implement n = 10, we allocate 170 bits for each intermediate result to perform computations on a 151bit modulus. This overhead is required because the algorithm does not compute the final reduction to save time.
Our implementation uses a fixed pipeline of three DSP cores to sequentially calculate the result. Using a sequential approach that processes individual word segments provides several advantages for lowlevel optimizations, e.g., the effi cient use of counters to address input and output memory.
Figure 3 shows the configuration of these three DSP cores. The index i denotes the round while the index j represents the current 17bit block of the input. Note that an additional feedback path which connects the result block to the qi input is not included in the figure for the sake of clarity. Our multiplier needs a total of n(n+2)+6 clock cycles to compute and return the result.
2) Modular Addition/Subtraction Unit: The modular addi tion/subtraction A+B mod M is no critical operation in terms of clock cycles. However, the proposed formulae for elliptic
8
8
8
6
6
6
+
A 18 [35]
DSP3: Shift(P) + (A:B + C)
B 18 [34..17]
(shared) C [16..0] 48
[A:B]
48 P
(shared) C
curve computations require that two modular additions must not exceed the time of one modular multiplication. Again, we use the arithmetic power of DSP blocks for implementing the modular addition, significantly reducing the required amount of configurable logic resources.
results are based on ECM parameters B1 = 960, B2 =57,000, D = 210 and the fixed scalar e of 1,323 bit. These parameters were used in most of the related work and allow a fair comparison.
Our ECM system is designed to allow straightforward scalability and to factor composite integers with smaller or larger bit sizes. It supports bit length of L = 17(b−1)−2 with b = 5..15 without notably hardware changes. This corresponds to an input bit range from 66 to 236 bits. To adjust the bit length, we have to place and route the ECM system with modified generics increasing or decreasing the internal size of the output buffers of the elliptic curve processor without changing the rest of the architecture. Afterwards, we generate a new instruction sequence and load it into the instruction ROM.
TABLE I: Implementation results of our ECM system factor ing integers up to 151 bit.
Table I shows the results for the block size b = 10, factoring integers up to 151 bit in 4.73 ms (phase 1) and 5.12 ms (phase 2). This corresponds to 5,064 computations per second for the first phase and 2,424 for phase 1+2.
We compare our result to the Virtex4 implementation by Gaj et al. and de Meulenaer et al. in Table II. For a fair comparison, we scaled our design to support the same bit lengths as reported in [9], [10]. For a consistent pricing model, we refer to the official Xilinx distributor Avnet Electronics Marketing4. The information are as of January, 2010 for single purchase without discount, rounded up to the next dollar.
Note that comparing to the architecture proposed by Gaj et al. is possible since performance figures for same platform (Xilinx Virtex4) and scope of implementation (ECM phase 1 and phase 2) are available. However, we like to emphasize that this does not work for the design by de Meulenaer et al. The latter design only implements phase 1 of the ECM what provides a higher performance at the cost of a reduced success probability to find a prime factor of the composite integer. Hence, this design is listed for completeness but does not allow an expressive comparison.
Considering the reported work by Gaj et al. our design provides the significant better costperformance ratio by a factor of 37 for ECM phase 1+2 on the same class of Virtex4 FPGAs. This ratio is likely to be improved when using cost optimized Spartan6 devices instead of Virtex4 FPGAs.
4see http://em.avnet.com
8
8
cin
[A:B]
DSP2: Reduction
48
48
8
8
cin
[A:B] ±
48
DSP1: Calculation
M[33..18]Æ A 1
1
ÆA1
A[17..0] B[17..0]
S [34]
�
[34]
R
M[17..0]Æ B
A[33..18] B[33..18]
FPGA Type Number of Cores
Virtex4 SX 35 24
Arithmetic Clock Frequency # Cycles (Phase I)
# Cycles (Phase II)
# Cycles (Phase I + II) Time (Phase I)
Time (Phase II)
200 MHz 945,746 1,024,029 1,969,775 4.73 ms 5.12 ms
Phase I per Second Phase I + II per Second
5,064 2,424
ÆB1
Fig. 4: DSP configuration of the modular addition/subtraction unit.
Figure 4 depicts the DSP configuration for the addi tion/subtraction unit, consisting of two pipelined and cascaded DSP cores. Each core supports the concatenation of the (signed) 18 bit inputs A and B. We use this to load two 17bit blocks of the first value Aj into the result register and use the feedback path to add two 17bit blocks of the second value Bj . This requires a dynamic operation mode configuration, as the DSP alternates between loading and adding the feedback path3 .
After processing the block j, the first DSP passes the intermediate result Rj = Aj ± Bj(+ CIN) to the second DSP for reduction Sj = Rj ∓ Mj (+ CIN). Both operations work in parallel, only shifted by one clock cycle. At the end of the computation, the unit evaluates the borrow bit of the subtraction (either by the main or reduction DSP) and determines the correct output, i.e., R or S, respectively.
As we decided to feed results from a modular multiplication directly to this unit, we have to process n blocks with 17bit each and subsequently reduce them by 2Mˇ . With respect to the pipelining registers, this leads to a total of 2n + 3 clock cycles to compute and return the result.
IV. RESULTS
We present the results of our implementation based on Xil inx ISE Foundation 11 for synthesis and implementation. All
3Note that a static operation mode concept is possible and requires less logic, but does not work on Virtex4 FPGAs very effectively. On Virtex4 FPGAs, two DSP cores share one input port. This restriction does not exist for other FPGA families, such as Virtex5 or Virtex6.
8
8
8
7
7
7
Gaj et al.
This work
Factor
de Meulenaer et al.
This work
Factor
FPGA Device FPGA Cost Supported Bits
V4LX20011 7,564 $ 198
V4SX3510 468 $ 202
0.06 1.02
V4SX2510 298 $ 135
V4SX3510 468 $ 134
1.57 0.99
Max. ECM Cores Max. Frequency
24 104 MHz
24 200 MHz
1.00 1.92
1 220 MHz
24 200 MHz
24 0.91
Cycles for Addition Cycles for Multiplication Cycles for Phase 1
Time for Phase 1
41 216 1,666,500 16 ms
29 201 1,473,596 7.37 ms
0.71 0.93 0.88 0.46
1
1 13,750 63 μs
21 105 797,288 3.99 ms
21 105 59.98 63.78
# Phase 1 per Second
# Phase 1 per Second per 100 $
# Phase 1+2 per Second
# Phase 1+1 per Second per 100 $
1,448 19 696 9
3,240 692 1,560 333
2.24 36.42 2.24 37
16,000* 5,369* 

6,000 1,282 2,880 615
0.375 0.239 

TABLE II: Comparison of our results using b = 13 (L = 202 bit) and b = 9 (L = 134 bit) with Virtex4 LX 200 (Gaj et al.) and
Virtex4 SX 25 (de Meulenaer et al.) implementations. Note ECM phase 1 with reduced success probability.
V. CONCLUSION
In this work, we presented an implementation of phase 1 and 2 of the Elliptic Curve Method for integer factorization. Our efficient design allows to run up to 24 parallel ECM cores on a single Virtex4 SX35 FPGA and can be scaled to factor 66 to 236 bit integers without considerable hardware changes. Picking up the ideas by Šimka et al. and Gaj et al. we implemented the underlying arithmetic function using the support of DSP blocks and used additional optimizations such as an advanced memory management and improved addition chains for point multiplication.
For 151bit integers, our implementation can provide 2,424 factorizations per second per FPGA running both phases of the ECM. Given an interface fast enough to provide the I/O data, our implementation scales to a total of 310,272 factorizations per second on a fully equipped COPACOBANA cluster populated with 128 Virtex4 devices. Compared to the implementation by Gaj et al. who reported a similar design for the same Virtex4 platform, our architecture provides a costperformance ratio which is better by a factor of 37.
ACKNOWLEDGEMENTS
The work described in this paper has been supported by the Federal Office for Information Security (BSI), Germany, and the European Commission through the ICT program under contract ICT2007216676 ECRYPT II.
REFERENCES
[1] H. Lenstra, "Factoring Integers with Elliptic Curves," Annals of Math ematics, vol. 126, pp. 649–673, 1987.
[2] J. M. Pollard, "Theorems on factorization and primality testing," in Proceedings of the Cambridge Philosophy Society, 1974, pp. 521–528.
[3] H. C. Williams, "A p + 1 method of factoring," Mathematics of
Computation, vol. 39, pp. 225–234, 1982.
[4] A. K. Lenstra and H. W. J. Lenstra, Eds., The Development of the
Number Field Sieve, ser. LNM. SpringerVerlag, 1993, vol. 1554.
[5] P. Zimmermann, A. Kruppa, B. Gladman, J. Papadopoulos, J. Feltin, L. Fousse, P. Gaudry, R. Cosset, and T. Kleinjung, "GMPECM Web
site," June 2010, available at https://gforge.inria.fr/projects/ecm/.
[6] D. J. Bernstein, T.R. Chen, C.M. Cheng, T. Lange, and B.Y. Yang, "ECM on graphics cards," in Eurocrypt 2009, ser. LNCS, vol. 5479,
that figures with asterisk solely represent a design implementing
[7] M. Šimka, J. Pelzl, T. Kleinjung, J. Franke, C. Priplata, C. Stahlke, M. Drutarovský, V. Fischer, and C. Paar, "Hardware Factorization Based on Elliptic Curve Method," in Proceedings of the IEEE Symposium on FieldProgrammable Custom Computing Machines (FCCM 2005), J. Arnold and K. L. Pocek, Eds. IEEE Computer Society, April 1820 2005, pp. 107–116.
[8] K. Gaj, S. Kwon, P. Baier, P. Kohlbrenner, H. Le, M. Khaleeluddin, and R. Bachimanchi, "Implementing the Elliptic Curve Method of Factoring in Reconfigurable Hardware," in Proceedings of the Workshop on Cryptograpic Hardware and Embedded Systems (CHES 2006), ser. LNCS, L. Goubin and M. Matsui, Eds., vol. 4249. SpringerVerlag, 2006, pp. 119–133.
[9] K. Gaj, S. Kwon, P. Baier, P. Kohlbrenner, H. Le, M. Khaleeluddin, R. Bachimanchi, and M. Rogawski, "Areatime efficient implementation of the elliptic curve method of factoring in reconfigurable hardware for application in the number field sieve," IEEE Transactions on Computers, vol. 99, no. PrePrints, 2009.
[10] G. de Meulenaer, F. Gosset, M. M. de Dormale, and J.J. Quisqater, "Integer Factorization Based on Elliptic Curve Method: Towards Bet ter Exploitation of Reconfigurable Hardware," in Proceedings of the IEEE Symposium on FieldProgrammable Custom Computing Machines (FCCM 2007). IEEE Computer Society, 2007, pp. 197–206.
[11] T. Güneysu, T. Kasper, M. Novotný, C. Paar, and A. Rupp, "Cryptanal ysis with COPACOBANA," IEEE Transactions on Computers, vol. 57, no. 11, pp. 1498–1513, November 2008.
[12] T. Güneysu, C. Paar, G. Pfeiffer, and M. Schimmler, "Enhancing COPA COBANA for advanced applications in cryptography and cryptanalysis," in Proceedings of the Conference on Field Programmable Logic and Applications (FPL 2008), 2008, pp. 675–678.
[13] A. J. Menezes, P. C. van Oorschot, and S. A. Vanstone, Handbook of Applied Cryptography. New York: CRC Press, 1996.
[14] R. P. Brent, "Some Integer Factorization Algorithms Using Elliptic Curves," Australian Computer Science Communications, vol. 8, pp. 149– 163, 1986.
[15] P. L. Montgomery, "Speeding the Pollard and elliptic curve methods of factorization," Mathematics of Computation, vol. 48, no. 177, pp. 243–264, 1987.
[16] H. M. Edwards, "A normal form for elliptic curves," Bulletin American Mathematical Society, vol. 44, no. 3, p. 393, 2007, http://www.ams.org/bull/2007 44 03/S0273 0979 07 01153 6/
S0273 0979 07 01153 6.pdf.
[17] P. L. Montgomery, "Modular Multiplication Without Trial Division," Mathematics of Computation, vol. 44, no. 170, pp. 519–521, April 1985.
[18] H. Orup, "Simplifying quotient determination in highradix modular multiplication," in ARITH '95: Proceedings of the 12th Symposium on
2009, pp. 483–501.
8
8
8
8
8
Computer Arithmetic. 1995, p. 193.
Washington, DC, USA: IEEE Computer Society,
8
Results Timing Design
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.187.7244&rep=rep1&type=pdf
Comments
Post a Comment