先写结论
System.Random是线程不安全的,要避免使用单例,或者进行加锁操作。参考:
https://docs.microsoft.com/en-us/dotnet/api/system.random
这里有一句话:If you don’t ensure that the Random object is accessed in a thread-safe way, calls to methods that return random numbers return 0.
简单说就是如果在多线程情况下,可能导致返回的随机数永远为0。
问题排查过程
今天上午,突然有个站点挂了,请求都出现超时,上去一看,服务器CPU满了,被站点吃了,按惯例,先用任务管理器,创建转储文件,完成后重启站点,此时站点也恢复正常了。
把dmp转储文件拖回来,用windbg打开,用命令 !runaway 看到有9个线程,执行超过2分钟:
0:035> !runaway
User Mode Time
Thread Time
30:6504 0 days 0:42:29.390
35:2e10 0 days 0:37:07.000
34:3d0c 0 days 0:35:55.437
38:1274 0 days 0:33:56.390
39:1c44 0 days 0:16:29.000
40:5fd8 0 days 0:16:01.781
36:2124 0 days 0:10:47.609
51:30f8 0 days 0:06:15.671
54:2954 0 days 0:02:05.140
9:3450 0 days 0:00:18.656
用命令 ~30s 切换到30这个线程,再用 !clrstack 看托管代码,发现如下代码:
0:030> !clrstack
OS Thread Id: 0x6504 (30)
Child SP IP Call Site
000000487470b440 00007ffb528e1f50 System.Random.InternalSample()
000000487470b470 00007ffb531d8138 System.Random.NextBytes(Byte[])
000000487470b4c0 00007ffaffafe87a Pinpoint.Core.Util.SpanIdUtil.GetNewSpanId()
000000487470b4f0 00007ffb014a53fb Pinpoint.Core.Util.SpanIdUtil.GetNextSpanId(Int64, Int64)
000000487470b530 00007ffb014a5370 Pinpoint.Core.TraceId.GetNextTraceId()
再看了一下另外8个线程,都是在是Random.InternalSample 和 Random.NextBytes 方法,
初步判断是不是Random导致的问题,也没怎么看代码,
就去Google上搜索: system.random.nextbytes high cpu
结果很正常,搜索不到任何资料,再去翻了一下同事写的代码,
就是简单的Random单例封装,我以前好像也是这么写的:
public class SpanIdUtil
{
private static readonly Random rnd = new Random(Environment.TickCount);
public static long GetNextSpanId(long spanId, long parentSpanId)
{
long newId = GetNewSpanId();
while (newId == spanId || newId == parentSpanId)
newId = GetNewSpanId();
return newId;
}
public static long GetNewSpanId()
{
byte[] buf = new byte[8];
rnd.NextBytes(buf);
return BitConverter.ToInt64(buf, 0);
}
}
又用WinDbg看堆栈、看线程池,都没能发现什么问题……
后来去stackoverflow上咨询了一下,那边的兄弟说应该是线程不安全,导致死循环了,
再看了一下msdn,看到了上面那句话,这回有了线索,再翻Windbg里的堆栈,用 !clrstack -a
0:034> !clrstack -a
OS Thread Id: 0x3d0c (34)
Child SP IP Call Site
000000487643b948 00007ffb698b0bba [HijackFrame: 000000487643b948]
000000487643ba00 00007ffb531d8138 System.Random.NextBytes(Byte[])
PARAMETERS:
this (<CLR reg>) = 0x0000004767e170b8
buffer (<CLR reg>) = 0x00000043e7eabae8
LOCALS:
<CLR reg> = 0x000000005dc11500
000000487643ba50 00007ffaffafe87a Pinpoint.Core.Util.SpanIdUtil.GetNewSpanId()
LOCALS:
<no data>
000000487643ba80 00007ffb014a53fb Pinpoint.Core.Util.SpanIdUtil.GetNextSpanId(Int64, Int64)
PARAMETERS:
spanId (<CLR reg>) = 0x0000000000000000
parentSpanId (<CLR reg>) = 0xffffffffffffffff
LOCALS:
<no data>
000000487643bac0 00007ffb014a5370 Pinpoint.Core.TraceId.GetNextTraceId()
PARAMETERS:
this (<CLR reg>) = 0x00000044e7ea87e0
发现 这一句: spanId () = 0x0000000000000000
这个spanId也是之前的业务得到的随机数,不太可能为0,而所有线程里,这个参数都是0,
看来就是这个问题,导致那段代码里的while (newId == spanId 始终为true,从而进入死循环,导致CPU消耗。
问题找到,就好解决了,把代码简单改造2个地方:
1、当然是 lock(rnd)
2、对循环多加一层次数判断,避免还有其它可能导致死循环
public class SpanIdUtil
{
private static readonly Random rnd = new Random(Environment.TickCount);
public static long GetNextSpanId(long spanId, long parentSpanId)
{
long newId = GetNewSpanId();
int total = 0;
while (total++ < 10 && (newId == spanId || newId == parentSpanId))
newId = GetNewSpanId();
if(total >= 10)
throw new Exception("Can not gen next span id:" + newId.ToString());
return newId;
}
public static long GetNewSpanId()
{
byte[] buf = new byte[8];
lock (rnd)
rnd.NextBytes(buf);
return BitConverter.ToInt64(buf, 0);
}
}
自己写了一段代码测试,重现这个问题:
class Program
{
static Random rnd = new Random(Guid.NewGuid().GetHashCode());
static void Main(string[] args)
{
var loop = 2;
ThreadPool.SetMinThreads(loop, 10);
for (var i = 0; i < loop; i++)
{
ThreadPool.UnsafeQueueUserWorkItem(RndTest, null);
}
while (true)
{
Thread.Sleep(1000);
Console.WriteLine(rnd.Next() + " ===");// 只要loop大于2,这些永远输出0
}
Console.ReadLine();
}
static void RndTest(object state)
{
while (true)
//lock(rnd)
rnd.Next();
}
}
小结一下,之前没碰到过 Random 还有这种坑,
另外就是对一些循环的结束判断语句,如果不能确保它正常结束,一定要增加次数判断来避免可能的问题,增加代码的健壮性。