Parallelisation: NPAR switch, and LPLANE switch

Next: LASYNC Up: The INCAR File Previous: LELF Contents

Parallelisation: NPAR switch, and LPLANE switch

VASP currently offers parallelization (and data distribution) over bands and parallelization (and data distribution) over plane wave coefficients (see also Section 4). To get a high efficiency on massively parallel systems it is strongly recommended to use both at the same time. The only algorithm which works with the over band distribution is the RMM-DIIS iterative matrix diagonalization (IALGO=48). The conjugate gradient band-by-band method (IALGO=8) is only supported for parallelization over plane wave coefficients.

NPAR tells VASP to switch on parallelization (and data distribution) over bands. NPAR=1 implies distribution over plane wave coefficients only (IALGO=8 and IALGO=48 both work), All nodes will work on each band. We suggest to use this default setting only when running on a small number of nodes.

In VASP.4.5, the default for NPAR is equal to the (total number of nodes). For NPAR=(total number of nodes), each band will be treated by only one node. This can improve the performance for platforms with a small communication bandwidth, however it also increases the memory requirements considerably, because the non local projector functions must be stored in that case on each node. In addition a lot of communication is required to orthogonalize the bands. If NPAR is neither 1, nor equal to the number of nodes, the number of nodes working on one band is given by

$\begin{displaymath} \mbox{total number nodes}/ NPAR. \end{displaymath}$

The second switch which influences the data distribution is LPLANE. If LPLANE is set to .TRUE. in the INCAR file, the data distribution in real space is done plane wise. Any combination of NPAR and LPLANE can be used. Generally, LPLANE=.TRUE. reduces the communication band width during the FFT's, but at the same time it unfortunately worsens the load balancing on massively parallel machines. LPLANE=.TRUE. should only be used if NGZ is at least 3*(number of nodes)/NPAR, and optimal load balancing is achieved if NGZ=n*NPAR, where n is an arbitrary integer. If LPLANE=.TRUE. and if the real space projector functions (LREAL=.TRUE. or ON or AUTO) are used, it might be necessary to check the lines following


 real space projector functions
  total allocation   :
  max/ min on nodes  :

The max/ min values should not differ too much, otherwise the load balancing might worsen as well.

The optimum setting of NPAR and LPLANE depends very much on the type of machine you are running. Here are a few guidelines

SGI power challenge:
Usually one is running on a relatively small number of nodes, so that load balancing is no problem. Also the communication band width is reasonably good on SGI power challenge machines. Best performance is often achived with
```
 LPLANE = .TRUE.
 NPAR   = 1 
 NSIM   = 1
```
Increasing NPAR usually worsens performance. For NPAR=1 we have in fact observed a superlinear scaling w.r.t. the number of nodes in many cases. This is due to the fact that the cache on the SGI power challenge machines is relatively large (4 Mbytes); if the number of nodes is increased the real space projectors (or reciprocal projectors) can be kept in the cache and therefore cache misses decrease significantly if the number of nodes are increased.
SGI Origin: The SGI Origin behaves quite differently from the SGI Power Challenge. Mainly because the memory bandwidth is a factor of three better than on the SGI Power Challenge. The following setting seems to be optimal when running on 4-16 nodes:
```
 LPLANE = .TRUE.
 NPAR   = 4
 NSIM   = 4
```
Contrary to the SGI Power Challenge superlinear scaling could not be observed, obviously because data locality and cache reusage is only of minor importance on the Origin 2000.
LINUX cluster linked by 100 Mbit Ethernet: On a LINUX cluster linked by a relatively slow network, LPLANE must be set to .TRUE., and the NPAR flag should be equal to the number of nodes:
```
 LPLANE = .TRUE.
 NPAR   = number of nodes.
 LSCALU = .FALSE.
 NSIM   = 4
```
Mind that you need at least a 100 Mbit full duplex network, with a fast switch offering at least 2 Gbit switch capacity.
T3D, T3E On many T3D, T3E platforms one is forced to use a huge number of nodes. In that case load balancing problems and problems with the communication bandwidth are likely to be experienced. In addition the cache is fairly small on T3E and T3D machines so that it is impossible to keep the real space projectors in the cache with any setting. Therefore, we recommend to set NPAR on these machines to (explicit timing can be helpful to find the optimum value). The use of LPLANE = .TRUE. is only recommend if the number of nodes is significantly smaller than NGX, NGY and NGZ.
In summary the following setting is recommended
```
 LPLANE = .FALSE.
 NPAR   = sqrt(number of nodes)
 NSIM   = 1
```

Next: LASYNC Up: The INCAR File Previous: LELF Contents

Georg Kresse
2009-04-23