大家好,需要先读懂上一篇博客文章一文说尽32bit素数检测,打好基础,才能更好地理解这篇博客文章。
https://blog.csdn.net/songpeilin815/article/details/135181545?spm=1001.2014.3001.5501
首先编写64bit快速幂模运算的代码。
void PowerMod_U64(unsigned long long x, \
unsigned long long N, unsigned long long m, unsigned long long& R);
64bit幂模运算,逐次平方法计算x^(N)模m的非负余数。
当进行64bit乘以64bit的乘法运算的时候,运算符*是无能为力的,因为64bit乘以64bit导致128bit运算结果,而运算符*只保留了低64bit,舍弃了高64bit,这就需要调用指令集来完成。
unsigned __int64 _umul128(unsigned __int64 Multiplier, unsigned __int64 Multiplicand, unsigned __int64 *HighProduct),这个函数完成Multiplicand * Multiplier,结果分别保存在两个变量里,高64bit HighProduct在引用参数里,低64bit LowProduct作为返回结果。因为64bit乘以64bit导致128bit运算结果,那么除法就是128bit除以64bit的除法,运算符/也是不行的,仍然需要调用指令集。
unsigned __int64 _udiv128(unsigned __int64 highDividend, unsigned __int64 lowDividend, unsigned __int64 divisor, unsigned __int64 *remainder),在这个函数里highDividend是128bit被除数的高64bit,就是前面的HighProduct,低64bit被除数是lowDividend,就是前面的LowProduct,divisor是除数,remainder是运算后的余数,这个函数的返回值是商。除法运算符/和取余运算符%是不行的,它们操纵不了128bit被除数。
_umul128和_udiv128需要打开x64编译选项才行,包含相关头文件,而且必须是VS 2022版本才能编译通过
#include <intrin.h>
#include <immintrin.h>
以下是具体代码。没什么可讲的,只要大家把上一篇博客文章“一文说尽32bit素数检测”完全看懂,那么这个代码很容易理解,把_umul128想象为乘法符号*,把_udiv128想象为除法符号/就行了。
void PowerMod_U64(unsigned long long x, unsigned long long N, \
unsigned long long m, unsigned long long& R)
{
if (m <= 1)
{
if (m == 0)
{
std::cout << "除数为零,程序强行终止!\n";
exit(0);
}
else
{
R = 0;
return;
}
}
if (N <= 1)
{
if (N == 0)
{
R = 1;
return;
}
else
{
R = x % m;
return;
}
}
unsigned long long k;
unsigned long long HiPart;
unsigned long long LoPart;
x %= m;/*特别注意*/
R = 1;
if (N <= 4)
{
for (k = 0; k < N; k++)
{
LoPart = _umul128(R, x, &HiPart);
_udiv128(HiPart, LoPart, m, &R);
}
}
else
{
unsigned long Max;
unsigned long long MaskCode;
_BitScanReverse64(&Max, N);
MaskCode = (1ULL << (Max - 1));
for (k = 1; k <= MaskCode; k <<= 1)
{
if (N & k)
{
LoPart = _umul128(R, x, &HiPart);
_udiv128(HiPart, LoPart, m, &R);
}
LoPart = _umul128(x, x, &HiPart);
_udiv128(HiPart, LoPart, m, &x);
}
LoPart = _umul128(R, x, &HiPart);
_udiv128(HiPart, LoPart, m, &R);
}
}
在上面这段代码中,我写下了一条注释,_udiv128这条指令是完成128bit除以64比他的除法,商可能溢出。因为这条指令的返回值是商,余数放在引用参数中,余数是不会溢出的。关键是商,这条指令只保留64bit商,大家可以想象一下,如果被除数接近128bit满度,而除数只是2或者3这样的很小的数,那么商是不是会很大以至于超过64bit,那样就溢出了。_udiv128是做带余数除法用的,全除法得额外编写代码。x %= m,就可以避免溢出了,如果还没有听懂,看视频讲解。
下面是64bit米勒拉宾算法,一个证据数的基本算法。
bool StdMillerRabin_U64(unsigned long long Witness, \
unsigned long long Composite);
/*
★★★★★实用函数。
64bit米勒拉宾测试。
返回值:
Composite为合数:true;
Composite为素数:false(基于概率)。
*/
bool StdMillerRabin_U64(unsigned long long Witness, \
unsigned long long Composite)
{
if (Witness < 2)
{
std::cout << "Witness必须大于等于2!\n";
exit(0);
}
if (Composite <= 2)
{
if (Composite == 2)
return false;
else
return true;
}
if ((Composite & 1ULL) == 0)
return true;
if (Composite == Witness)
return false;
else
{
if (Composite % Witness == 0)
return true;
}
unsigned long long C_1 = Composite - 1;
unsigned long k;
unsigned long long q = C_1;
unsigned long long Remainder;
unsigned long long HiPart;
unsigned long long LoPart;
unsigned long n;
_BitScanForward64(&k, q);
q >>= k;
PowerMod_U64(Witness, q, Composite, Remainder);
if ((Remainder == 1) || (Remainder == C_1))
return false;
for (n = 1; n < k; n++)
{
LoPart = _umul128(Remainder, Remainder, &HiPart);
_udiv128(HiPart, LoPart, Composite, &Remainder);
if (Remainder == C_1)
return false;
}
return true;
}
一口吃不了胖子,把上一篇博客文章“一文说尽32bit素数检测”完全看懂,这个立即就能理解。
下面是64bit米勒拉宾算法,一组证据数的基本算法,循环而已。
bool ExtMillerRabin_U64(unsigned long long* pWitness, \
unsigned int Length, unsigned long long Composite);
/*
★实用函数。
64bit米勒拉宾测试。
使用一组数pWitness来判定Composite是否为合数的米勒拉宾测试算法,Length指定
证据数个数。
返回值:
Composite为合数:true;
Composite为素数:false(基于概率)。
*/
bool ExtMillerRabin_U64(unsigned long long* pWitness, \
unsigned int Length, unsigned long long Composite)
{
bool isComposite = false;
for (unsigned int k = 0; k < Length; k++)
{
isComposite = StdMillerRabin_U64(pWitness[k], Composite);
if (isComposite)
break;
}
return isComposite;
}
基本64bit素数检测代码。
bool StdPrimeQ_U64(unsigned long long P, bool UseJimSinclairWitness = true, \
bool PrintNT = false);
/*
★实用函数,标准速度4us以内(Jim Sinclair法)。
*/
bool StdPrimeQ_U64(unsigned long long P, bool UseJimSinclairWitness, \
bool PrintNT)
{
if (P <= 4294967295)
return StdPrimeQ_U32((unsigned int)P, PrintNT);
bool isPrime = false;
unsigned long long pWitness1[7] = { 2, 325, 9375, 28178, 450775, \
9780504, 1795265022 };
unsigned long long pWitness2[13] = { 2, 3, 5, 7, 11, 13, 17, 19, \
23, 29, 31, 37, 41 };
if (UseJimSinclairWitness)
isPrime = !ExtMillerRabin_U64(pWitness1, 7, P);
else
isPrime = !ExtMillerRabin_U64(pWitness2, 13, P);
if (PrintNT)
{
if (isPrime)
std::cout << P << "是素数。\n";
else
std::cout << P << "是合数。\n";
}
return isPrime;
}
UseJimSinclairWitness = true:使用Jim Sinclair发现的一组数作为证据数。
UseJimSinclairWitness = false:使用前13个素数作为证据数。
JimSinclair发现了一组数:2,325,9375,28178,450775,9780504,1795265022,用它们作为证据数,2^64以内100%判定。
http://miller-rabin.appspot.com/
外网链接,大家可能打不开,我把截图贴出来。
文献资料。
论文题目: On the Number of Witnesses in the Miller-Rabin Primality Test
论文作者: Shamil Talgatovich Ishmukhametov,Bulat Gazinurovich Mubarakov,Ramilya Gakilevna Rubtsova
日期出处: Symmetry 2020,12(6),890; https://doi.org/10.3390/sym12060890
在这篇论文中作者写道:
The last record is reached by J.Sorenson and J.Webster in 2016. They found ψ12 and ψ13, where ψ13 = 3317044064679887385961981.So at the moment we can successfully determine prime integers less than it by only 13 rounds of the MR test.
用前13个素数作为证据数,可以100%断言3317044064679887385961981以内的素数。使用13个证据数当然必使用7个证据数慢。
把所有代码写在一起,增强速度。
bool ExtPrimeQ_U64(unsigned long long P);
/*
★★★★★实用函数,增强速度3us以内。
单线程。
*/
bool ExtPrimeQ_U64(unsigned long long P)
{
if (P <= 4294967295)
return ExtPrimeQ_U32((unsigned int)P);
if ((P & 1ULL) == 0)
return false;
if (P % 3 == 0)
return false;
if (P % 5 == 0)
return false;
if (P % 7 == 0)
return false;
unsigned long long Witness = 2;
unsigned long long Remainder = 1;
unsigned long long HiPart;
unsigned long long LoPart;
unsigned long long P_1 = P - 1;
unsigned long k;
unsigned long long qbk = P_1;
unsigned long long q;
unsigned long t;
unsigned long Counter = 0;
bool isPrime = true;
_BitScanForward64(&k, qbk);
qbk >>= k;
NextRound:
q = qbk;
for (; q >= 2; q >>= 1)
{
if (q & 1ULL)
{
LoPart = _umul128(Remainder, Witness, &HiPart);
_udiv128(HiPart, LoPart, P, &Remainder);
}
LoPart = _umul128(Witness, Witness, &HiPart);
_udiv128(HiPart, LoPart, P, &Witness);
}
LoPart = _umul128(Remainder, Witness, &HiPart);
_udiv128(HiPart, LoPart, P, &Remainder);
if ((Remainder == 1) || (Remainder == P_1))
{
switch (Counter)
{
case 0:
Counter = 1;
Witness = 325;
Remainder = 1;
goto NextRound;
case 1:
Counter = 2;
Witness = 9375;
Remainder = 1;
goto NextRound;
case 2:
Counter = 3;
Witness = 28178;
Remainder = 1;
goto NextRound;
case 3:
Counter = 4;
Witness = 450775;
Remainder = 1;
goto NextRound;
case 4:
Counter = 5;
Witness = 9780504;
Remainder = 1;
goto NextRound;
case 5:
Counter = 6;
Witness = 1795265022;
Remainder = 1;
goto NextRound;
default:
goto ReturnHere;
}
}
for (t = 1; t < k; t++)
{
LoPart = _umul128(Remainder, Remainder, &HiPart);
_udiv128(HiPart, LoPart, P, &Remainder);
if (Remainder == P_1)
{
switch (Counter)
{
case 0:
Counter = 1;
Witness = 325;
Remainder = 1;
goto NextRound;
case 1:
Counter = 2;
Witness = 9375;
Remainder = 1;
goto NextRound;
case 2:
Counter = 3;
Witness = 28178;
Remainder = 1;
goto NextRound;
case 3:
Counter = 4;
Witness = 450775;
Remainder = 1;
goto NextRound;
case 4:
Counter = 5;
Witness = 9780504;
Remainder = 1;
goto NextRound;
case 5:
Counter = 6;
Witness = 1795265022;
Remainder = 1;
goto NextRound;
default:
goto ReturnHere;
}
}
}
isPrime = false;
ReturnHere:
return isPrime;
}
下面开一个猛料,使用双线程把速度再次提升。这个不好写文字讲解,直接看视频。
抖音:SPL19750815
我简单说一下思路:
1.多线程,是不能随用随开的。因为这个代码很快,开线程的时间开销远远大于检测素数的时间,所以线程是程序一上来就开启,直到程序结束而终止。如果只检测几个素数,那么单线程的更快,如果检测几千个以上,那么就是双线程更快。
2.怎么自动开启一个多线程呢?我得假设使用者一无所知,毫不知情地情况下就把这些初始化工作完成了。我给大家讲的是思路,不是讲死代码。
3.CAutoRun_NT这个类没有任何实质用途,就是生成一个全局对象,自动调用构造函数,在构造函数里自动开启线程,在析构函数里销毁线程。
4.主线程和子线程靠结构体ThreadParameters_NT传送参数。
/*
必须使用Visual Studio 2022以上版本,在工具栏上,选择x64编译选项。
*/
#define STARTGLOBALTASKTHREAD
/*
1.保留上面的语句,开启双线程。
2.注释上面的语句,只有单线程。
*************************************************************************
结论1:使用双线程,就是为了再快一点儿。但是本代码不依赖于双线程,单线程照
样比其他软件快。
结论2:我开的这个双线程,有可能干扰使用者的程序进行双线程设计了,此时应该
注释这行代码。如果使用者同时开启了两个线程,并且两个线程同时调用了
我的某些双线程函数,必定崩溃!
*************************************************************************
*/
class CAutoRun_NT
{
public:
static unsigned int Counter;
CAutoRun_NT();
virtual ~CAutoRun_NT();
};
/*
没有任何实质用途,生成一个全局对象,做一些初始化工作。
*/
struct ThreadParameters_NT
{
unsigned int Token;
bool Close;
bool Stop;
unsigned long long P;
bool isPrime;
};
/*
在主线程和子线程之间传输数据。
*/
void GlobalTaskThread_NT();
/*
◯该函数自动调用。
全局线程函数。
*/
unsigned int CAutoRun_NT::Counter = 0;
CAutoRun_NT GlobalAutoRun_NT;
ThreadParameters_NT GlobalThreadParameters_NT;
CAutoRun_NT::CAutoRun_NT()
{
if (Counter == 0)
{
#ifdef STARTGLOBALTASKTHREAD
GlobalThreadParameters_NT.Token = 0;
GlobalThreadParameters_NT.Close = false;
GlobalThreadParameters_NT.Stop = false;
GlobalThreadParameters_NT.P = 2;
GlobalThreadParameters_NT.isPrime = true;
thread Task(GlobalTaskThread_NT);
Task.detach();
#endif // STARTGLOBALTASKTHREAD
std::cout << "已完成初始化!\n\n";
}
Counter++;
}
CAutoRun_NT::~CAutoRun_NT()
{
Counter--;
if (Counter == 0)
{
#ifdef STARTGLOBALTASKTHREAD
GlobalThreadParameters_NT.Token = 0xFFFFFFFF;
while (GlobalThreadParameters_NT.Close == false)
{
__nop(); __nop(); __nop(); __nop();
__nop(); __nop(); __nop(); __nop();
}
#endif // STARTGLOBALTASKTHREAD
}
}
#include "Prime2.h"
extern ThreadParameters_NT GlobalThreadParameters_NT;
void GlobalTaskThread_NT()
{
unsigned long long Witness = 0;
unsigned long long Remainder = 0;
unsigned long long ProductHigh = 0;
unsigned long long ProductLow = 0;
unsigned long long P = 0;
unsigned long long P_1 = 0;
unsigned long k = 0;
unsigned long long q = 0;
unsigned long long qbk = 0;
unsigned long t = 0;
unsigned long Counter = 0;
for (;;)
{
if (GlobalThreadParameters_NT.Token == 1)
{
Witness = 450775;
Remainder = 1;
P = GlobalThreadParameters_NT.P;
qbk = P_1 = P - 1;
Counter = 0;
_BitScanForward64(&k, qbk);
qbk >>= k;
TheFutureTheBetter:
q = qbk;
for (; q >= 2; q >>= 1)
{
if (q & 1ULL)
{
ProductLow = _umul128(\
Remainder, Witness, &ProductHigh);
_udiv128(ProductHigh, ProductLow, P, &Remainder);
}
ProductLow = _umul128(Witness, Witness, &ProductHigh);
_udiv128(ProductHigh, ProductLow, P, &Witness);
}
ProductLow = _umul128(Remainder, Witness, &ProductHigh);
_udiv128(ProductHigh, ProductLow, P, &Remainder);
if ((Remainder == 1) || (Remainder == P_1))
{
/*
如果这个if语句成立,说明当前这个证据数失效,换下一个。 换
之前,扫描一下主线程是否下达过结束命令。
*/
if (GlobalThreadParameters_NT.Stop)
goto ReturnHere;
switch (Counter)
{
case 0:
Counter = 1;
Witness = 9780504;
Remainder = 1;
goto TheFutureTheBetter;
case 1:
Counter = 2;
Witness = 1795265022;
Remainder = 1;
goto TheFutureTheBetter;
default:
goto ReturnHere;
}
}
for (t = 1; t < k; t++)
{
ProductLow = _umul128(\
Remainder, Remainder, &ProductHigh);
_udiv128(ProductHigh, ProductLow, P, &Remainder);
if (Remainder == P_1)
{
if (GlobalThreadParameters_NT.Stop)
goto ReturnHere;
switch (Counter)
{
case 0:
Counter = 1;
Witness = 9780504;
Remainder = 1;
goto TheFutureTheBetter;
case 1:
Counter = 2;
Witness = 1795265022;
Remainder = 1;
goto TheFutureTheBetter;
default:
goto ReturnHere;
}
}
}
GlobalThreadParameters_NT.isPrime = false;
ReturnHere:
GlobalThreadParameters_NT.Token = 0;
}
__nop(); __nop(); __nop(); __nop();
__nop(); __nop(); __nop(); __nop();
if (GlobalThreadParameters_NT.Token == 0xFFFFFFFF)
{
GlobalThreadParameters_NT.Close = true;
break;
}
/*
main函数结束之前,全局对象GlobalAutoRun_NT的析构函数被调用,
置WhichOperation = 0xFFFFFFFF,销毁子线程。
*/
__nop(); __nop(); __nop(); __nop();
__nop(); __nop(); __nop(); __nop();
}
}
最后再贴一个代码运行时间测量函数。
void AbsoluteTiming(unsigned int LoopCount);
/*
★★★★★实用函数。
测量代码执行时间。
被测量的函数代码或者函数调用需要使用者自己写进去,LoopCount是循环次数,最
后算平均值。打开这个函数的定义, 里面注释写得清清楚楚需要在哪里添加代码。
名义上的时间分辨率是1us,实际上受到各种干扰没有那么精准。为了获得准确的时
间测量结果(排除其它干扰),累计总用时至少控制在1ms以上。
*/
void AbsoluteTiming(unsigned int LoopCount)
{
unsigned int k;
if (LoopCount == 0)
LoopCount = 1;
LARGE_INTEGER StartingTime;
LARGE_INTEGER EndingTime;
LARGE_INTEGER ElapsedMicroseconds{};
LARGE_INTEGER Frequency;
QueryPerformanceFrequency(&Frequency);
QueryPerformanceCounter(&StartingTime);
//***********************************************************************
//在此处加入初始化代码,例如声明变量和赋初始值等。
//以下注释掉的代码仅是示例。
for (k = 0; k < LoopCount; k++)
{
//在此处加入需要测量时间的代码或者函数调用。
//以下注释掉的代码仅是示例。
/*
ExtPrimeQ_U64(18446744073709551557);
*/
}
//***********************************************************************
QueryPerformanceCounter(&EndingTime);
ElapsedMicroseconds.QuadPart = \
EndingTime.QuadPart - StartingTime.QuadPart;
ElapsedMicroseconds.QuadPart *= 1000000;
ElapsedMicroseconds.QuadPart /= Frequency.QuadPart;
printf_s("循环%I32u次,总计用时%I64d微秒。\n", LoopCount, \
ElapsedMicroseconds.QuadPart);
double AverageTime = (double)ElapsedMicroseconds.QuadPart / LoopCount;
printf_s("平均每次用时%.1lf微秒。\n", AverageTime);
}
还有一个使用作弊方法的更快的32bit素数检测代码。用2,7,61这三个证据数100%检测33bit内的所有素数,还能不能更快呢?原则上是不能了,因为1个或者2个证据数必定有误判的,需要异想天开,鬼斧神工的思路才行。我就用2这个证据数来检测,事先把误判的找出来,然后写在一个表中(文件中),程序初始化的时候载入表,当检测的时候,先看看这个数是否在表中,如果在表中,则是合数,如果不在,那我就用一轮2的米勒拉宾检测搞定。那么64bit能不能用这种方法?抱歉不行了,因为2^64太大,你不可能实现找出来那些误判的数,需要1000年以上的时间。
bool ThunderPrimeQ_U32(unsigned int P);
/*
★★★★★实用函数。
2^32以内100%判定P是否为素数,雷电极速0.2us以内。 马后炮、作弊流、全武行。
这个函数依赖于作弊表!
*/
bool ThunderPrimeQ_U32(unsigned int P)
{
if (P <= 100)
{
switch (P)
{
case 2:
case 3:
case 5:
case 7:
case 11:
case 13:
case 17:
case 19:
case 23:
case 29:
case 31:
case 37:
case 41:
case 43:
case 47:
case 53:
case 59:
case 61:
case 67:
case 71:
case 73:
case 79:
case 83:
case 89:
case 97:
return true;
default:
return false;
}
}
if ((P & 1) == 0)
return false;
if (P % 3 == 0)
return false;
if (P % 5 == 0)
return false;
if (P % 7 == 0)
return false;
if ((P == 2047) || (P == 4294901761))
return false;
unsigned int u = 0;
unsigned int v = 2255;
unsigned int w;
for (;;)
{
if ((u + 1) == v)
break;
else
{
w = (u + v) >> 1;
if (P == pQuickTable_NT[w])
return false;
else if (P < pQuickTable_NT[w])
v = w;
else
u = w;
}
}
/*
速查表存储着用证据数2的米勒拉宾算法无法断言的那些数,这些数都是合数。
第一个是2047,最后一个是4294901761,总计2256个,二分查找。 如果命中的
话,P就是合数;如果没有命中,再用一轮2的米勒拉宾算法判定。
*/
unsigned int Witness = 2147483648 % P;
unsigned int Remainder = 1;
unsigned long long Product;
unsigned int P_1 = P - 1;
unsigned long k;
unsigned int q = P_1;
unsigned long Max;
unsigned int t;
_BitScanForward(&k, q);
q >>= k;
u = _udiv64(q, 31, &v);
if (u)
{
if (u <= 12)
{
for (t = 0; t < u; t++)
{
Product = (unsigned long long)Remainder * Witness;
_udiv64(Product, P, &Remainder);
}
}
else
{
_BitScanReverse(&Max, u);
Max = (1 << (Max - 1));
for (t = 1; t <= Max; t <<= 1)
{
if (u & t)
{
Product = (unsigned long long)Remainder * Witness;
_udiv64(Product, P, &Remainder);
}
Product = (unsigned long long)Witness * Witness;
_udiv64(Product, P, &Witness);
}
Product = (unsigned long long)Remainder * Witness;
_udiv64(Product, P, &Remainder);
}
}
Product = ((unsigned long long)Remainder << v);
_udiv64(Product, P, &Remainder);
if ((Remainder == 1) || (Remainder == P_1))
return true;
for (t = 1; t < k; t++)
{
Product = (unsigned long long)Remainder * Remainder;
_udiv64(Product, P, &Remainder);
if (Remainder == P_1)
return true;
}
return false;
}
如果没有看懂,看视频讲解。
跟老师我学思路,不是学死代码!
视频没有审核通过之前可以看抖音:SPL19750815
代码链接: