The table also shows the timings for the bench.Hg.tar and bench.PdO benchmarks, which are located on the VASP server in the src directory (bench.Hg.tar.gz and bench-PdO.tar.gz). The shown numbers are those written in the line ``LOOP+'' in the OUTCAR file (type: grep 'LOOP+' OUTCAR).
You can test your own machine by compiling ffttest and dgemmtest in
the VASP.4.X (X3) directory, and typing
dgemmtest <lincom.table dgemmtest <rpro.table ffttestThis will execute the tests ``Lincom-TPP'', ``matrix-vec'' and ``fft'' in this order (serial version only). Note that the present algorithms make the matrix-vector part less important than the synthetic mix of ``Lincom-TPP'', ``matrix-vec'' and ``fft''. In addition for the bench.Hg benchmark, the performance of the matrix-matrix part plays a more significant role than in the synthetic benchmark.
Currently, all high performance machines run VASP fairly well. The cheapest option (best value at lowest price) are presently AMD Athlon-64 based and Intel P4 PC's. For compilation we recommend the ifc compiler. Which processor (clock speed) to buy depends a little bit on the budget and the available space. If you need a high packing density, dual Opteron machines are a good option. IBM Power 4 based machines, Intel Itanium (SGI Altix, HP-UX) remain competitive, but at a somewhat steeper price than PC's.
IBM RS6000 | IBM RS6000 | IBM RS6000 | IBM RS6000 | IBM RS6000 | IBM SP3 | |
590 | 3CT | 595![]() |
595![]() |
397 | High Node | |
lincom-TPP(Mflops) | 245 | 237 | 389 | 389 | 580 | 1220 |
matrix-vec(Mflops) | 110 | 73/128 | 110 | 110 | 300 | 300/400 |
Lincom-TPP | 40.6 s | 42.7 s | 25.0 s | 21.4 s | 17.8 s | 8.4 s |
matrix-vec | 32.3 s | 40.4 s | 32.3 s | 19.4 s | 15.3 s | 12.1 s |
fft | 31.4 s | 35.0 s | 24.0 s | 17.3 s | 14.4 s | 5.1 s |
TOTAL | 103 s | 117 s | 81.3 s | 58.3 s | 47.5 s | 26.8 s |
RATING | 1 | 0.9 | 1.3 | 1.8 | 2.2 | 3.8 |
bench.Hg | 1663 | 1920 | 1380 | 1000 | 809 | 356 |
IBM RS6000 | IBM SP4 | ITANIUM 2 | ITANIUM 2 | Altix 350 | Altix 3700 Bx2 | |
590 | 1300 | 1300 | 1600 | 1600 | ||
HP-UX | LINUX | SUSE SLES9 | SUSE SLES 9 | |||
lincom-TPP(Mflops) | 245 | 3100 | 5000 | 4300 | 5932 | 6129 |
matrix-vec(Mflops) | 110 | 600/800 | 1200/2300 | 1200/1500 | 1378/2021 | 2671/3135 |
Lincom-TPP | 40.6 s | 3.2 s | 2.0 s | 2.3 s | 1.7 s | 1.7 s |
matrix-vec | 32.3 s | 6.0 s | 2.3 s | 2.6 s | 3.1 s | 1.9 s |
fft | 31.4 s | 2.8 s | 1.7 s | 2.1 s | 1.1 s | 1.1 s |
TOTAL | 103 s | 12.0 s | 6.0 s | 7.2 s | 5.9 s | 4.7 s |
RATING | 1 | 8.5 | 16.3 | 14.8 | 17.5 | 21.9 |
bench.Hg | 1663 | 181/50![]() |
127 | 135 | 81 | 76 |
bench.PdO | 4000/1129![]() |
2758 | 2900 | 1733 | 1625/450![]() |
|
SGI | SGI | SUN | DEC-SX | DEC-LX | ||
Power C. | Origin | USparc 366 | ev5/530 | ev5/530 | ||
lincom-TPP(Mflops) | 300 | 430 | 290 | 439 | 650 | |
matrix-vec(Mflops) | 38 | 100/150 | 42/65 | 74/108 | 67/100 | |
Lincom-TPP | 32.0 s | 22.0 s | 19.7 s | 21.8 s | 14.3 s | |
matrix-vec | 90.2 s | 31.0 s | 59 s | 40.3 s | 48.8 s | |
fft | 41.0 s | 17.0 s | 24 s | 26.1 s | 17.8 s | |
TOTAL | 163 s | 70 s | 111 s | 90 s | 81 s | |
RATING | 0.64 | 1.47 | 0.9 | 1.12 | 1.3 | |
bench.Hg | 2200/653![]() |
1200/330![]() |
1660 | 1424 | 1140 | |
DS20 | DS20![]() |
DS20e![]() |
UP2000 | UP2000![]() |
UP 1000 | |
ev6/500 | ev6/500 | ev6/666 | ev6/666 | ev6/666 | ev6/600 | |
lincom-TPP(Mflops) | 800 | 1000 | 1200 | 1100 | 1100 | 800 |
matrix-vec(Mflops) | 135/200 | 135/200 | 135/200 | 170/260 | 140/200 | |
Lincom-TPP | 12.0 s | 10.6 s | 8.4 s | 9.3 s | 9.0 s | 11.4 s |
matrix-vec | 19.8 s | 20.8 s | 17.6 s | 17.9 s | 17.1 s | 30.0 s |
fft | 9.8 s | 8.6 s | 6.7 s | 8.5 s | 7.7 s | 10.9 s |
TOTAL | 41.4 s | 40.0 s | 33.7 s | 35.7 s | 34 s | 52 s |
RATING | 2.4 | 2.6 | 3.1 | 2.8 | 3.0 | 2.0 |
bench.Hg | 546 | 536 | 385 | 465 | 453 | 786 |
bench.Hg![]() |
584 | 564 | 395 | 516 | 485 | |
bench.PdO | 10792 | 8151 | ||||
CRAY T3D![]() |
CRAY T3E![]() |
CRAY T3E![]() |
CRAY | CRAY | VPP | |
ev4 | ev5 | 1200 | C90 | J90 | 500 | |
lincom-TPP(Mflops) | 96 | 400 | 579 | 800 | 188 | 1500 |
matrix-vec(Mflops) | 28/42 | 101 | 101 | 459 | 50 | 600 |
lincom-tpp | 99.5 s | 25 s | 16.5 s | 12.0 s | 53 s | 7.1 s |
matrix-vec | 110.0 s | 33 s | 33 s | 8.3 s | 74 s | 5.0 s |
fft | 174.0 s | 42 s | 34 s | 6.9 s | 43 s | 5.4 s |
TOTAL | 400 s | 100 s | 100 s | 27.2 s | 170 s | 17.5 s |
RATING | 0.25 | 1.0 | 1.2 | 4.1 | 0.6 | 6.5 |
bench.Hg | 639![]() |
420 ![]() |
220 |
LINUX | Xeon GX | Xeon GX | PIII BX | PIII BX | PIII | |
based PC's | 450 | 550/512 | 450 | 500 | 700c | |
lincom-TPP(Mflops) | 268 | 378 | 303 | 324 | 500 | |
matrix-vec(Mflops) | 70/100 | 90/120 | 80/105 | 90/118 | 90/118 | |
Lincom-TPP | 36 s | 27.3 s | 34.0 s | 32.9 s | 29.6 s | |
matrix-vec | 44 s | 37.1 s | 43.2 s | 41.9 s | 30.0 s | |
fft | 27 s | 22.4 s | 26.6 s | 24.6 s | 25.1 s | |
TOTAL | 107 s | 87 s | 104 s | 100 s | 84 s | |
RATING | 1 | 1.18 | 1.0 | 0.9 | 0.9 | |
bench.Hg | 1631 | 2000 | 1866 | 1789 | ||
LINUX![]() |
Athlon | Athlon | Athlon | Athlon![]() |
Athlon![]() |
Athlon![]() |
based PC's | 550 | TB 800 | TB 850 | TB 850 | TB 900 | 1200 |
lincom-TPP(Mflops) | 700 | 770 | 800 | 850 | 890 | 1100 |
matrix-vec(Mflops) | 100/142 | 115/190 | 115/190 | 130/210 | 120/200 | 200/300 |
Lincom-TPP | 16.8 s | 12.8 s | 12.3 s | 11.6 s | 11.3 s | 8.6 s |
matrix-vec | 30.6 s | 26.3 s | 25.8 s | 22.6 s | 24.6 s | 18.7 s |
fft | 19.5 s | 18.7 s | 18.0 s | 17.3 s | 14.0 s | 10.9 s |
TOTAL | 67 s | 57.8 s | 56 s | 51.5 s | 50 s | 38.3 s |
RATING | 1.5 | 1.8 | 1.8 | 2.0 | 2.1 | 2.5 |
bench.Hg | 1350 s | 1131 s | 1124 s | 1045 s | 959 s | 818 s |
LINUX | Athlon![]() |
Athlon![]() |
Opteron![]() |
Opteron![]() |
Opteron![]() |
Opteron![]() |
based PC's | 1400![]() |
XP/1900![]() |
244 | 246 | 250 | 246 |
SDRAM | DDR | 32 bit | 32 bit | 32 bit | 64 bit | |
lincom-TPP(Mflops) | 1200 | 2200 | 2900 | 3300 | 3800 | 3300 |
matrix-vec(Mflops) | 200/300 | 230/370 | 650/850 | 700/950 | 750/1050 | 700/950 |
Lincom-TPP | 5.9 s | 4.9 s | 3.5 s | 3.1 s | 2.7 s | 3.2 s |
matrix-vec | 17.3 s | 13.1 s | 5.4 s | 4.3 s | 4.2 s | 3.9 s |
fft | 9.8 s | 7.3 s | 3.3 s | 3.0 s | 2.6 s | 2.6 s |
TOTAL | 39.3 s | 25.3 s | 12.2 | 10.4 s | 9.5 s | 9.8 s |
RATING | ||||||
bench.Hg | 644 | 455 | 248 | 203 | 177 | 211 |
bench.PdO | 8412 | 4840 | 4256 | 3506 | 4172 | |
LINUX![]() |
Ath-64![]() |
|||||
based PC's | 3700+ | |||||
DDRAM | ||||||
lincom-TPP(Mflops) | 3400 | |||||
matrix-vec(Mflops) | 700/1050 | |||||
Lincom-TPP | 2.9 s | |||||
matrix-vec | 4.3 s | |||||
fft | 2.6 s | |||||
TOTAL | 9.8 s | |||||
RATING | ||||||
bench.Hg | 173 | |||||
bench.PdO | 3550 |
LINUX | P4![]() |
XEON![]() |
XEON![]() |
XEON![]() |
P4 nrthw![]() |
P4 nrthw![]() |
based PC's | 1700 | 2400 | 2800 | 2800 | 3200 | 3400 |
RAMBUS | RAMBUS | RAMBUS | DDR | FSB 800 | FSB 800 | |
lincom-TPP(Mflops) | 2000 | 3030 | 4100 | 4200 | 4700 | 5400 |
matrix-vec(Mflops) | 422/555 | 600/750 | 566/880 | 650/950 | 890/1300 | 1200/1500 |
Lincom-TPP | 5.5 s | 3.5 s | 2.6 s | 2.5 s | 2.3 s | 2.0 s |
matrix-vec | 7.6 s | 5.3 s | 5.6 s | 5.0 s | 3.9 s | 3.8 s |
fft | 7.5 s | 4.9 s | 3.1 s | 2.9 s | 2.6 s | 2.4 s |
TOTAL | 20.6 s | 13.7 s | 11.3 s | 10.5 s | 8.8 s | 8.2 s |
RATING | 5 | 7.5 | 9.4 | 10 | 11.7 | 12.5 |
bench.Hg | 384 | 298 | 226/94![]() |
208/85![]() |
175 | 165 |
bench.PdO | 7600 | 6335 | 4790/1801![]() |
4542/1787![]() |
3784 | 3250 |
LINUX | P4 pres![]() |
P4 pres![]() |
P4 pres![]() |
P4 940s![]() |
P4 940s![]() |
|
based PC's | 3200 | 3400 | 3400 | 2x3200 | 2x3200 | |
FSB800/DDR1 | FSB800/DDR2 | FSB800/DDR2 | FSB800/DDR2 | FSB800/DDR2 | ||
lincom-TPP(Mflops) | 5200 | 5200 | 5200 | 5500 | 5500 | |
matrix-vec(Mflops) | 1000/1300 | 1000/1300 | 1000/1300 | 1100/1400 | 1100/1400 | |
Lincom-TPP | 2.0 s | 2.0 s | 1.9 s | 1.9 s | ||
matrix-vec | 3.1 s | 3.1 s | 2.8 s | 2.8 s | ||
fft | 2.0 s | 2.0 s | 1.8 s | 1.7 s | ||
TOTAL | 7.1 s | 7.1 s | 7.1 s | 6.5 s | 6.5 s | |
RATING | 14.5 | 14.5 | 14.5 | 16.5 | 16.5 | |
bench.Hg | 148/47![]() |
144 | 129 | 129 | 111 | |
bench.PdO | 3224/939![]() |
2850 | 2580 | 2270 |
export MALLOC_MMAP_MAX_=0 export MALLOC_TRIM_THRESHOLD_=-1improve the performance by 10-20%!! NOTE: sometimes, the tables show very different timings for similar machines with similar clock rates. This is often related to an upgrade of the compiler or of the motherboard.