Copyright 2009- Jun Makino
mkdir /mnt/huge echo 5120 > /proc/sys/vm/nr_hugepages mount -t hugetlbfs none /mnt/huge -o mode=0777The name of the mmaped file is /mnt/huge/aaa.
Enter n, seed, nb:N=34816 Seed=1 NB=2048 a, awork offset size= 7fae21c00000 7fb064400000 242800000 242800000 242800000 size of size_t and off_t long= 8 8 8 read/set mat end copy mat end Emax= 5.044e-07 Nswap=0 cpsec = 2910.57 wsec=730.139 swaprows time=3.72784e+09 ops/cycle=0.162581 scalerow time=1.79725e+09 ops/cycle=0.337224 trans rtoc8 time=4.34249e+09 ops/cycle=0.139569 trans ctor8 time=2.40224e+09 ops/cycle=0.252296 trans mmul time=5.83755e+09 ops/cycle=1.55736 trans nonrec cdec time=2.9062e+09 ops/cycle=0.208546 trans vvmul time=1.02325e+09 ops/cycle=0.592304 trans findp time=1.86805e+09 ops/cycle=0.324444 solve tri u time=1.16613e+10 ops/cycle=2.9856e-06 solve tri time=1.6063e+11 ops/cycle=2.16747e-07 matmul nk8 time=0 ops/cycle=inf matmul snk time=6.36856e+09 ops/cycle=1.52267 trans mmul8 time=0 ops/cycle=inf trans mmul4 time=1.82159e+09 ops/cycle=1.33087 DGEMM time=1.91704e+12 ops/cycle=14.6762The nominal speed 38.5Gflops. This code uses DGEMM for DTRSM part, resulting in the doubling of the DTRSM cost. Thus, the raw speed is 41.9 Gflops, or 98% of the theoretical peak of 42.76 (/proc/cpuinfo says the clock spped is 2672.727 MHz) For smaller NB, calculation is faster, but the raw performance is slightly less (40.6Gflops)
Enter n, seed, nb:N=34816 Seed=1 NB=256 a, awork offset size= 7fad35600000 7faf77e00000 242800000 242800000 242800000 size of size_t and off_t long= 8 8 8 read/set mat end copy mat end Emax= 9.244e-08 Nswap=0 cpsec = 2797.93 wsec=700.362 swaprows time=3.57559e+09 ops/cycle=0.169504 scalerow time=1.71234e+09 ops/cycle=0.353947 trans rtoc8 time=4.35866e+09 ops/cycle=0.139051 trans ctor8 time=2.40053e+09 ops/cycle=0.252476 trans mmul time=5.83701e+09 ops/cycle=1.5575 trans nonrec cdec time=2.90349e+09 ops/cycle=0.208741 trans vvmul time=1.02339e+09 ops/cycle=0.592227 trans findp time=1.8687e+09 ops/cycle=0.324331 solve tri u time=5.74246e+08 ops/cycle=6.06291e-05 solve tri time=2.66881e+10 ops/cycle=1.30455e-06 matmul nk8 time=0 ops/cycle=inf matmul snk time=6.3678e+09 ops/cycle=1.52286 trans mmul8 time=0 ops/cycle=inf trans mmul4 time=1.8209e+09 ops/cycle=1.33138 DGEMM time=1.83881e+12 ops/cycle=15.3006
Enter n, seed, nb:N=34816 Seed=1 NB=256 read/set mat end copy mat end Emax= 1.136e-07 Nswap=0 cpsec = 2782.51 wsec=696.245 40.4095 Gflops swaprows time=3.43022e+09 ops/cycle=0.176688 scalerow time=1.56175e+07 ops/cycle=38.8077 trans rtoc time=3.14043e+09 ops/cycle=0.192992 trans ctor time=1.89603e+09 ops/cycle=0.319656 trans mmul time=5.14364e+09 ops/cycle=1.76745 trans nonrec cdec time=1.40592e+09 ops/cycle=0.431089 trans vvmul time=4.97509e+08 ops/cycle=1.21822 trans findp time=9.07189e+08 ops/cycle=0.668083 solve tri u time=3.85358e+08 ops/cycle=9.03471e-05 solve tri time=2.66933e+10 ops/cycle=11.6251 matmul nk8 time=0 ops/cycle=inf matmul snk time=59096 ops/cycle=164093 trans mmul8 time=0 ops/cycle=inf trans mmul4 time=1.39377e+09 ops/cycle=1.73939 trans mmul2 time=8.11773e+08 ops/cycle=1.49322 DGEMM time=1.84005e+12 ops/cycle=15.459