1. Design of HPC system
ASC 学生超级计算机挑战赛 2020
1.1 Hardwareresources
Item | Name | Configuration | Quantity |
---|---|---|---|
Server | Inspur NF5280M5 | CPU: Intel Xeon Gold 6230 x 2,2.1GHz,20 cores Memory: 32G x 12,DDR4,2933Mhz Hard disk: 480G SSD SATA x 1 Power consumption estimation: 6230 TDP 125W, memory 7.5W, hard disk 10W | 4 |
HCA card | EDR | InfiniBand Mellanox ConnectX®-5 HCA card, single port QSFP, EDR IB Power consumption estimation: 9W | 4 |
Switch | GbE switch | 10/100/1000Mb/s,24 ports Ethernet switch Power consumption estimation: 30W | 1 |
EDR-IB switch | Switc-IB™ EDR InfiniBand switch, 36 QSFP port Power consumption estimation: 130W | 1 | |
Cable | Gigabit CAT6 cables | CAT6 copper cable, blue, 3m | 1 |
InfiniBand cable | InfiniBand EDR copper cable, QSFP port, cooperating with the InfiniBand switch for use | 1 | |
GPU | NVIDIA Tesla V100-PCI-E | 8 | |
Hard disk | Samsung 970 EVO NVMe M.2(250GB | 4 | |
Memory | KingSton DDR4 2933MHz Server Premier (KSM29RD4/32ME) | 48 |
1.2 Software resources
(表格:Times New Roman/11)
Item | Name |
---|---|
operating system | Debian-9.4 |
Job scheduling system | Slurm-19.05.6 |
File system | zfs-0.6.5.9-1 |
Hardware monitoring software | Zabbix4.0+Grafana6.0 |
Package manager | Spack-3.7 |
Cluster management software | Clustershell-1.7.3 |
Parallel environment | MPICH-3.3.2 |
Intel MPI-5.1.3 | |
Open MPI-3.1.2 | |
Application Development Environment | Python-3.6.6 |
GUN translater-4.4.7 | |
PGI translater-17.4 | |
Inteltranslater-12.10 | |
BLAS-3.8,LAPACK-3.1.1,FFTW-3.2.2,intel MK-2018.0.3 | |
Anaconda2-4.3.0,cuda9.1,pytorch-gpu-1.0.1,hpl-2.3,hpcg-3.0 |
1.3 Clusteranalysis
1.3.1Architecture diagram
(三级标题:Times New Roman/三号/左对齐/黑体)
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-VvrFBmnp-1617031879617)(file:///C:\Users\24222\AppData\Local\Temp\msohtmlclip1\01\clip_image002.jpg)]
1.3.2 Clusterpower
Item | Power |
---|---|
CPU | 100W*8 |
GPU | 250W*8 |
HCA card | 9W*4 |
Switch | 30W+130W |
Hard disk+Memory | 7.5W+10W |
sum | 2933.5W |
1.3.3 Floatingpoint performance
(正文:Times New Roman/小四)
CPUtheoretical floating point peak :
=1.512GHZ22512/648*20=7741.44Gfloas=7.74TFLOPs
GPUtheoretical floating point peak :
7.0Tflops*8=56.0Tflops
1.3.4 Pros andcons
1.3.4 .1Advantage:
(四级标题:Times New Roman/小三/左对齐/黑体)
1.High scalability. Can easilyrealize the increase of nodes, system expansion and upgrade, and also reducethe hardware requirements through the cluster software.
\2. Simplemanagement and installation. The simple architecture maximizes performance andcan be quickly installed for practical applications.
\3. Richapplication software. Provide middleware to handle coordination andcommunication between nodes, so that the entire system node can truly achievecooperation and load balancing.
4.Advanced.Use high-speed infiniband network interconnection to form a computingenvironment, and support software and job scheduling system through parallelcomputing to make them work together
\5. Strongcomputing power. Can achieve parallel computing and powerful GPU processingcapabilities to meet the stringent requirements for running speed.
\6. Thevisual hardware management system controls the cluster power more accuratelyand efficiently within the required range.
1.3.4 .12Disadvantage:
\1. Thepositioning of small HPC clusters, where each node is not establishedseparately, also highlights the room for the cluster to be improved in terms ofstability.
2.Singlemanagement node: There is only one management node to manage metadata. When thecluster system reaches a certain scale, the management node will be overlybusy, and the management node will become the system bottleneck
3.Clustermanagement software may reduce the computing speed of the cluster to a certainextent
ystem bottleneck
3.Clustermanagement software may reduce the computing speed of the cluster to a certainextent