Some problem ran into during compiling and tests.
1. Actual Performance, Theoretical Peak Performance
Formula: Peak.perf = (#float.point.unit) * (#core) * Freq
http://www.cnblogs.com/kerrycode/archive/2012/07/06/2578658.html
http://software.intel.com/en-us/articles/performance-tools-for-software-developers-hpl-application-note/
The hybrid (mpi + openmp) parallel versions of HPL binaries are also included in the package
cat /proc/cpuinfo | grep GHz | uniq
cat /proc/meminfo
model name : Intel(R) Xeon(R) CPU E5640 @ 2.67GHz (use(MHz value /1024) will be more accurate)
N^2 *8 (float size) < Mem size * 80% N could achieve more than 37000
1 nodes 12188376 kb 11.62 G
2532.970 Mhz, 2.53 Ghz 9.8945/per core
Check NVIDIA information
nvidia-smi
Tesla C2050
Total: 2687 MB
http://www.siliconmechanics.com/files/C2050Benchmarks.pdf
1030 Gigaflops (single) 515 Gigaflops (double)
Test example:
GPU max gflops / (peak cpu + gpu) ratio: 0.8/0.7
50000 768 1 2 118.69 7.021e+02 59%
36864 1024 1 1 95.26 3.506e+02 (36k)
2. USE mpirun > mpiexec > mpirun_rsh
3. Intel MPI usage
ENV
source /home/limin/intel/impi/4.0.3.008/bin64/mpivars.sh
export LD_LIBRARY_PATH=/opt/intel/composer_xe_2013.0.079/mkl/lib/intel64:$LD_LIBRARY_PATH
RUN COMMAND
/home/limin/intel/impi/4.0.3.008/intel64/bin/mpirun -n 8 -f hosts -perhost 4 -genv I_MPI_PIN_DOMAIN node ./xhpl_hybrid_intel64
export OMP_NUM_THREADS=4
/home/limin/intel/impi/4.0.3.008/intel64/bin/mpirun -n 8 -f hosts -perhost 2 -genv I_MPI_PIN_DOMAIN=omp:scatter ./xhpl_hybrid_intel64
Other commands
~/mv/mv2/bin/mpdboot -n
~/mv/mv2/bin/mpiexec -machinefile hosts -np 1 ./xhpl_hybrid_intel64
openmpi的hwloc 查看信息
HPL Parameter Explaination:
WR 1 0 L 4 L/C/R 2
depth bcast rfact NDIV pfact NBMIN
tacc's suggestion
BCAST=5 //4 or 5 may be competitive for machines featuring very fast nodes comparatively to the network
MV2_IBA_HCA=mlx4_0
MV2_VBUF_TOTAL_SIZE=140000 same as MV2_IBA_EAGER_THRESHOLD
MV2_IBA_EAGER_THRESHOLD=140000
MV2_SHOW_ENV_INFO=1
MV2_ENABLE_AFFINITY=0
MV2_USE_SHARED_MEM=0 走网卡,disable cpu-binding
MV2_SMP_USE_LIMIC2=0 p48 disable, so don't need to change