----------------------------------------------------
Parallelization scheme:
----------------------------------------------------

The references for this algorithm are:
  (i)  Theory: X. Wu , A. Selloni, and R. Car, Phys. Rev. B 79, 085102 (2009).
  (ii) Implementation: H.-Y. Ko, B. Santra, R. A. DiStasio, L. Kong, and R. Car, arxiv.

The parallelization scheme in this algorithm is based upon the number of electronic states. 
In the current implementation, there are certain restrictions on the choice of the number of 
MPI tasks. Also slightly different algorithms are employed depending on whether the number of 
MPI tasks used in the calculation are greater or less than the number of electronic states.
We highly recommend users to follow the notes below.
This algorithm can be used most efficiently if the numbers of electronic states are 
uniformly distributed over the number of MPI tasks.  For a system having N electronic states,
the optimum numbers of MPI tasks (nproc) are the following:

  a) In case of nproc <= N, the optimum choices are N/m, where m is any positive integer. 
     Robustness         : Can be used for odd and even number of electronic states.
     OpenMP threads     : Can be used.
     Taskgroup          : Only the default value of the task group (-ntg 1) is allowed.

  b) In case of nproc  > N, the optimum choices are N*m, where m is any positive integer.
     Robustness         : Can be used for even number of electronic states.
     Largest value of m : As long as nj_max (see output) is greater than 1, however beyond m=8 
                          the scaling may become poor. The scaling should be tested by users.
     OpenMP threads     : Can be used and highly recommended. We have tested number of threads 
                          starting from 2 up to 64. More threads are also allowed. 
                          For very large calculations (nproc > 1000 ) efficiency can largely depend 
                          on the computer architecture and the balance between the MPI tasks and 
                          the OpenMP threads. User should test for an optimal balance.
                          Reasonably good scaling can be achieved by using m=6-8 and 
                          OpenMP threads=2-16.
     Taskgroup          : Can be greater than 1 and users should choose the largest possible value 
                          for ntg. To estimate ntg, find the value of nr3x in the output and 
                          compute nproc/nr3x and take the integer value. We have tested the value of
                          ntg as 2^m, where m is any positive integer. Other values of ntg should be 
                          used with caution.
     Ndiag              : Use -ndiag X option in the execution of cp.x. Without this option jobs 
                          may crash on certain architectures. Set X to any perfect square number 
                          which is equal to or less than N.

DEBUG : The EXX calculations always work when number of MPI tasks = number of electronic states. 
        In case of any uncertainty, the EXX energy computed using different numbers of MPI tasks can be 
        checked by performing test calculations using number of MPI tasks = number of electronic states.