next up previous contents
Next: Performance of parallel code Up: The installation of VASP Previous: Performance profile of some   Contents

Performance of serial code

The benchmark numbers given here have been measured using a benchmark designed to mimic the behavior of VASP. Three separate programs make up the benchmark. The first one measures matrix-matrix performance (Lincom-TPP), the second one matrix-vector performance (matrix-vec) and the final one the performance of 3d-FFT's (fft). The mixture of all three parts is supposed to be similar to what one would encounter, when simulating a large system (40-100 transition metal atoms). For the matrix$\times$ matrix performance DGEMM is used, for matrix$\times$ vector DGEMV, do-loops, or DGEMM results are reported (depending one where the machine scores highest). The fft benchmarks either use an optimized routine supplied by the manufacturer, or a routine written and optimized by J. Furthmüller

The table also shows the timings for the bench.Hg.tar and bench.PdO benchmarks, which are located on the VASP server in the src directory (bench.Hg.tar.gz and bench-PdO.tar.gz). The shown numbers are those written in the line ``LOOP+'' in the OUTCAR file (type: grep 'LOOP+' OUTCAR).

You can test your own machine by compiling ffttest and dgemmtest in the VASP.4.X (X$>$3) directory, and typing


 dgemmtest <lincom.table 
 dgemmtest <rpro.table
 ffttest
This will execute the tests ``Lincom-TPP'', ``matrix-vec'' and ``fft'' in this order (serial version only). Note that the present algorithms make the matrix-vector part less important than the synthetic mix of ``Lincom-TPP'', ``matrix-vec'' and ``fft''. In addition for the bench.Hg benchmark, the performance of the matrix-matrix part plays a more significant role than in the synthetic benchmark.

Currently, all high performance machines run VASP fairly well. The cheapest option (best value at lowest price) are presently AMD Athlon-64 based and Intel P4 PC's. For compilation we recommend the ifc compiler. Which processor (clock speed) to buy depends a little bit on the budget and the available space. If you need a high packing density, dual Opteron machines are a good option. IBM Power 4 based machines, Intel Itanium (SGI Altix, HP-UX) remain competitive, but at a somewhat steeper price than PC's.

IBM RS6000 IBM RS6000 IBM RS6000 IBM RS6000 IBM RS6000 IBM SP3
590 3CT 595$^{++}$ 595$^{++}$ 397 High Node
lincom-TPP(Mflops) 245 237 389 389 580 1220
matrix-vec(Mflops) 110 73/128 110 110 300 300/400
Lincom-TPP 40.6 s 42.7 s 25.0 s 21.4 s 17.8 s 8.4 s
matrix-vec 32.3 s 40.4 s 32.3 s 19.4 s 15.3 s 12.1 s
fft 31.4 s 35.0 s 24.0 s 17.3 s 14.4 s 5.1 s
TOTAL 103 s 117 s 81.3 s 58.3 s 47.5 s 26.8 s
RATING 1 0.9 1.3 1.8 2.2 3.8
bench.Hg 1663 1920 1380 1000 809 356
IBM RS6000 IBM SP4 ITANIUM 2 ITANIUM 2 Altix 350 Altix 3700 Bx2
590 1300 1300 1600 1600
HP-UX LINUX SUSE SLES9 SUSE SLES 9
lincom-TPP(Mflops) 245 3100 5000 4300 5932 6129
matrix-vec(Mflops) 110 600/800 1200/2300 1200/1500 1378/2021 2671/3135
Lincom-TPP 40.6 s 3.2 s 2.0 s 2.3 s 1.7 s 1.7 s
matrix-vec 32.3 s 6.0 s 2.3 s 2.6 s 3.1 s 1.9 s
fft 31.4 s 2.8 s 1.7 s 2.1 s 1.1 s 1.1 s
TOTAL 103 s 12.0 s 6.0 s 7.2 s 5.9 s 4.7 s
RATING 1 8.5 16.3 14.8 17.5 21.9
bench.Hg 1663 181/50$^*$ 127 135 81 76
bench.PdO 4000/1129$^*$ 2758 2900 1733 1625/450$^*$
SGI SGI SUN DEC-SX DEC-LX
Power C. Origin USparc 366 ev5/530 ev5/530
lincom-TPP(Mflops) 300 430 290 439 650
matrix-vec(Mflops) 38 100/150 42/65 74/108 67/100
Lincom-TPP 32.0 s 22.0 s 19.7 s 21.8 s 14.3 s
matrix-vec 90.2 s 31.0 s 59 s 40.3 s 48.8 s
fft 41.0 s 17.0 s 24 s 26.1 s 17.8 s
TOTAL 163 s 70 s 111 s 90 s 81 s
RATING 0.64 1.47 0.9 1.12 1.3
bench.Hg 2200/653$^*$ 1200/330$^*$ 1660 1424 1140
DS20 DS20$^2$ DS20e$^2$ UP2000 UP2000$^2$ UP 1000
ev6/500 ev6/500 ev6/666 ev6/666 ev6/666 ev6/600
lincom-TPP(Mflops) 800 1000 1200 1100 1100 800
matrix-vec(Mflops) 135/200 135/200 135/200 170/260 140/200
Lincom-TPP 12.0 s 10.6 s 8.4 s 9.3 s 9.0 s 11.4 s
matrix-vec 19.8 s 20.8 s 17.6 s 17.9 s 17.1 s 30.0 s
fft 9.8 s 8.6 s 6.7 s 8.5 s 7.7 s 10.9 s
TOTAL 41.4 s 40.0 s 33.7 s 35.7 s 34 s 52 s
RATING 2.4 2.6 3.1 2.8 3.0 2.0
bench.Hg 546 536 385 465 453 786
bench.Hg$^1$ 584 564 395 516 485
bench.PdO 10792 8151
CRAY T3D$^+$ CRAY T3E$^+$ CRAY T3E$^+$ CRAY CRAY VPP
ev4 ev5 1200 C90 J90 500
lincom-TPP(Mflops) 96 400 579 800 188 1500
matrix-vec(Mflops) 28/42 101 101 459 50 600
lincom-tpp 99.5 s 25 s 16.5 s 12.0 s 53 s 7.1 s
matrix-vec 110.0 s 33 s 33 s 8.3 s 74 s 5.0 s
fft 174.0 s 42 s 34 s 6.9 s 43 s 5.4 s
TOTAL 400 s 100 s 100 s 27.2 s 170 s 17.5 s
RATING 0.25 1.0 1.2 4.1 0.6 6.5
bench.Hg 639$+$ 420 $+$ 220
LINUX Xeon GX Xeon GX PIII BX PIII BX PIII
based PC's 450 550/512 450 500 700c
lincom-TPP(Mflops) 268 378 303 324 500
matrix-vec(Mflops) 70/100 90/120 80/105 90/118 90/118
Lincom-TPP 36 s 27.3 s 34.0 s 32.9 s 29.6 s
matrix-vec 44 s 37.1 s 43.2 s 41.9 s 30.0 s
fft 27 s 22.4 s 26.6 s 24.6 s 25.1 s
TOTAL 107 s 87 s 104 s 100 s 84 s
RATING 1 1.18 1.0 0.9 0.9
bench.Hg 1631 2000 1866 1789
LINUX$^{**}$ Athlon Athlon Athlon Athlon$^x$ Athlon$^x$ Athlon$^x$
based PC's 550 TB 800 TB 850 TB 850 TB 900 1200
lincom-TPP(Mflops) 700 770 800 850 890 1100
matrix-vec(Mflops) 100/142 115/190 115/190 130/210 120/200 200/300
Lincom-TPP 16.8 s 12.8 s 12.3 s 11.6 s 11.3 s 8.6 s
matrix-vec 30.6 s 26.3 s 25.8 s 22.6 s 24.6 s 18.7 s
fft 19.5 s 18.7 s 18.0 s 17.3 s 14.0 s 10.9 s
TOTAL 67 s 57.8 s 56 s 51.5 s 50 s 38.3 s
RATING 1.5 1.8 1.8 2.0 2.1 2.5
bench.Hg 1350 s 1131 s 1124 s 1045 s 959 s 818 s
LINUX Athlon$^i$ Athlon$^i$ Opteron$^j$ Opteron$^k$ Opteron$^k$ Opteron$^p$
based PC's 1400$^b$ XP/1900$^b$ 244 246 250 246
SDRAM DDR 32 bit 32 bit 32 bit 64 bit
lincom-TPP(Mflops) 1200 2200 2900 3300 3800 3300
matrix-vec(Mflops) 200/300 230/370 650/850 700/950 750/1050 700/950
Lincom-TPP 5.9 s 4.9 s 3.5 s 3.1 s 2.7 s 3.2 s
matrix-vec 17.3 s 13.1 s 5.4 s 4.3 s 4.2 s 3.9 s
fft 9.8 s 7.3 s 3.3 s 3.0 s 2.6 s 2.6 s
TOTAL 39.3 s 25.3 s 12.2 10.4 s 9.5 s 9.8 s
RATING
bench.Hg 644 455 248 203 177 211
bench.PdO 8412 4840 4256 3506 4172
LINUX$^{**}$ Ath-64$^k$
based PC's 3700+
DDRAM
lincom-TPP(Mflops) 3400
matrix-vec(Mflops) 700/1050
Lincom-TPP 2.9 s
matrix-vec 4.3 s
fft 2.6 s
TOTAL 9.8 s
RATING
bench.Hg 173
bench.PdO 3550
LINUX P4$^i$ XEON$^i$ XEON$^j$ XEON$^j$ P4 nrthw$^k$ P4 nrthw$^j$
based PC's 1700 2400 2800 2800 3200 3400
RAMBUS RAMBUS RAMBUS DDR FSB 800 FSB 800
lincom-TPP(Mflops) 2000 3030 4100 4200 4700 5400
matrix-vec(Mflops) 422/555 600/750 566/880 650/950 890/1300 1200/1500
Lincom-TPP 5.5 s 3.5 s 2.6 s 2.5 s 2.3 s 2.0 s
matrix-vec 7.6 s 5.3 s 5.6 s 5.0 s 3.9 s 3.8 s
fft 7.5 s 4.9 s 3.1 s 2.9 s 2.6 s 2.4 s
TOTAL 20.6 s 13.7 s 11.3 s 10.5 s 8.8 s 8.2 s
RATING 5 7.5 9.4 10 11.7 12.5
bench.Hg 384 298 226/94$^*$ 208/85$^*$ 175 165
bench.PdO 7600 6335 4790/1801$^*$ 4542/1787$^*$ 3784 3250
LINUX P4 pres$^k$ P4 pres$^j$ P4 pres$^k$ P4 940s$^k$ P4 940s$^l$
based PC's 3200 3400 3400 2x3200 2x3200
FSB800/DDR1 FSB800/DDR2 FSB800/DDR2 FSB800/DDR2 FSB800/DDR2
lincom-TPP(Mflops) 5200 5200 5200 5500 5500
matrix-vec(Mflops) 1000/1300 1000/1300 1000/1300 1100/1400 1100/1400
Lincom-TPP 2.0 s 2.0 s 1.9 s 1.9 s
matrix-vec 3.1 s 3.1 s 2.8 s 2.8 s
fft 2.0 s 2.0 s 1.8 s 1.7 s
TOTAL 7.1 s 7.1 s 7.1 s 6.5 s 6.5 s
RATING 14.5 14.5 14.5 16.5 16.5
bench.Hg 148/47$^*$ 144 129 129 111
bench.PdO 3224/939$^*$ 2850 2580 2270
$^+$ VASP.4.4, hardware data streaming enabled; bench.Hg is running on 4 nodes, all other data per node
$^{++}$ system equipped with 2 (first) or 4 (second) memory boards.
$^*$ second value is for 4 nodes
$^{**}$ all Athlon results use the Atlas based BLAS (http://www.netlib.org/atlas/)
$^x$ pgf90 -tp athlon, Atlas optimised BLAS for TB, 133 MHz memory
$^1$ benchmark executed twice on (dual processor SMP machines)
$^2$ TRUE 64, other Alpha benchmarks were performed under LINUX
$^i$ Intel compiler, ifc, mkl performance lib on P4, Atlas on Athlon
$^A$ VIA KT 266A, other XP benchmarks performed with VIA KT 266
$^j$ Intel compiler, ifc7.1, libgoto_p4_512-r0.6.so or libgoto_p4_1024-r0.96.so on P4 and libgoto_opt32-r0.92.so on Athlon, fftw.3.0.1
$^k$ Intel compiler, ifc7.1, libgoto_p4_1024-r0.96.so on P4 or libgoto_opt32-r0.92.so on Opteron, fftw.3.0.1 and -Duse_cray_ptr
$^l$ ia64, Intel compiler, ifc9.1, libgoto_prescott64p-r1.00.so, fftw.3.1.2 and -Duse_cray_ptr
$^p$ pgi IMPORTANT: on ALPHA-LINUX the two options

     export MALLOC_MMAP_MAX_=0
     export MALLOC_TRIM_THRESHOLD_=-1
improve the performance by 10-20%!! NOTE: sometimes, the tables show very different timings for similar machines with similar clock rates. This is often related to an upgrade of the compiler or of the motherboard.


next up previous contents
Next: Performance of parallel code Up: The installation of VASP Previous: Performance profile of some   Contents
Georg Kresse
2009-04-23