Large-scale quantum Monte Carlo electronic structure calculations on the EGEE grid

Antonio Monari, Anthony Scemama, and Michel Caffarel
Laboratoire de Chimie et Physique Quantiques, CNRS-IRSAMC
Université de Toulouse, France.

Proceedings of the 5th International Workshop on Distributed Cooperative Laboratories: Instrumenting the Grid (2010)

A grid implementation of a massively parallel quantum Monte Carlo (QMC) code on the EGEE grid architecture is discussed. Technical details allowing an efficient implementation are presented and the grid performance (number of queued, running, and executed tasks as a function of time) is discussed. Finally, we present a very accurate Li2 potential energy curve obtained by running simultaneously several hundreds tasks on the grid.

Introduction

Quantum Monte Carlo (QMC) methods are known to be powerful stochastic approaches for solving the Schroedinger equation.[1] Although they have been widely used in computational physics during the last twenty years, they are still of marginal use in computational chemistry.[2] Two major reasons can be invoked for that: i) the N-body problem encountered in chemistry is particularly challenging (a set of strongly interacting electrons in the field of highly-attractive nuclei) ii.) the level of numerical accuracy required is very high (the so-called "chemical accuracy"). In computational chemistry, the two standard approaches used presently are the Density Functional Theory (DFT) approaches and the various post-Hartree-Fock wavefunction-based methods (Configuration Interaction, Coupled Cluster, etc.) In practice, DFT methods are the most popular approaches, essentially because they combine both a reasonable accuracy and a favorable scaling of the computational effort as a function of the number of electrons. On the other hand, post-HF methods are also employed since they lead to a greater and much controlled accuracy than DFT. Unfortunately, the price to pay for such an accuracy is too high to be of practical use for large molecular systems.

QMC appears as a third promising alternative method essentially because it combines the advantages of both approaches: a favorable scaling together with a very good accuracy. In addition to this, and it is the central point of the present note, the QMC approaches - in sharp contrast with DFT and post-HF methods - are ideally suited to High-Performance-Computing (HPC) and, more specifically, to massive parallel computations either on homogeneous multi-processor platforms or on heterogeneous grid infrastructures. As most "classical" or "quantum" Monte Carlo approaches, the algorithm is essentially of the number crunching type, the central memory requirements remain small and bounded and the I/O flows are essentially marginal. Due to these extremely favorable computational aspects plus the rapid evolution of computational infrastructures towards more and more numerous and efficient processors, it is likely that QMC will play in the next years a growing role in computational chemistry.

In the present study, the first implementation of our quantum Monte Carlo program on a large scale grid - the European EGEE-III grid[3] - is presented. As a scientific application we have chosen to compute with a very high accuracy the potential energy curve (PEC) of the Li2 molecule (total energy of the system as a function of the Li-Li distance). To the best of our knowlegde, the curve presented here is the most accurate PEC ever published for this system. In order to reach such an accuracy two conditions need to be fulfilled. First, a large enough Monte Carlo statistics has to be realized to reduce the final statistical error down to the precision desired. Second, accurate enough trial wave functions must be employed to reduce as much as possible the so-called "Fixed-Node" error (the only systematic error left in a QMC calculation, see Ref.[2]). The first condition is easy to fulfill since the system is small (only, six electrons) and accumulating statistics is just a matter of making enough Monte Carlo steps and using enough processors ("brute force" approach). The second condition is much more challenging since we need to introduce trial wavefunctions with a controlled nodal quality and which can be improved in a systematic way. Here, we have realized this latter aspect by considering wavefunctions issued from Full Configuration Interaction (Full CI) calculations in a large basis-set (technically, the cc-pVQZ basis set, Ref.[4]). Such a FCI trial wavefunction is expected to have a very good nodal structure. However, there is a price to pay: To handle such a function is quite expensive. More precisely, the FCI trial wavefunction used here is expressed a sum of 16,138 products of two 3x3 determinants (three alpha-electrons and three beta-electrons) and, at each Monte Carlo step, this wavefunction and its first- and second- derivatives have to be computed. Note that the computational cost in terms of CPU time is directly proportional to the number of products in the trial wavefunction expansion. To the best of our knowledge, it is the first time that such a high number of determinants in a QMC calculation has been used. Let us emphasize that it has been possible here only because of the use of the grid infrastructure.

In Sec. \ref{details} some technical details related to the implementation of a quantum Monte Carlo simulation and the use of our QMC=Chem[5] program are presented. Section \ref{strategy} presents the computational strategy employed in our application to the Li2 molecule. Section \ref{performance} gives the results and discusses the performance. Finally, some conclusions are presented in Sec.\ref{conclu}.

Technical details

\label{details}

A walker is a vector $\textbf{X}$ of the 3N-dimensional space containing the entire set of the three-dimensional coordinates of the N electrons of the system. During the simulation, a walker (or a population of walkers) samples via a Monte Carlo Markov Chain process the 3N-dimensional space according to some target probability density (the precise density may vary from one QMC method to another). From a practical point of view, the averages of the quantities of interest (energy, densities, etc.) are calculated over a set as large as possible of independent random walks. Random walks differ from each other only in the initial electron positions ${\textbf{X}}_0$, and in the initial random seed S0 determining the entire series of random numbers used.

In the QMC=Chem code used here, the main computational object is a block. In a block, Nwalk independent walkers realize random walks of length Nstep, and the quantities of interest are averaged over all the steps of each random walk. If Nstep is significantly larger than the auto-correlation time (which is usually rather small), the positions of the walkers at the end of the block can be considered as independent of their initial positions and a new block can be sampled using these configurations as ${\textbf{X}}_0$ and using the current random seed as S0.

The final Monte Carlo result is obtained by averaging all the results obtained for each block. If the data associated with each block are saved on disk, the averages can be calculated as a post-processing of the data and the calculation can be easily restarted using the last positions of the walkers and the last random seed.

Note that the computation of the averages does not require any time ordering. If the user provides a set of Nproc different initial conditions (walker positions and random seed), the blocks can be computed in parallel. In figure 1, we give a pictorial representation of four independent processors computing blocks sequentially, each block having different initial conditions.

Design of the QMC=Chem program

The QMC=Chem program was designed specifically to run on heterogeneous clusters via the Message Passing Interface (MPI) API[6] and in grid environments via Python[7] scripts. The memory requirements, disk input/outputs and network communications were minimized as much as possible, and the code was written in order to allow asynchronous processes.

The behavior of the program is the following. A main Python program spawns three processes: a data server, a computation engine, and an observer (see figure 2).

The observer

The observer keeps a global knowledge of the whole calculation (current values of the computed averages, total CPU time, wall time, etc). It updates the results using the calculated blocks at regular intervals of time and checks if the code should continue or stop by informing the data server. It also checks if an "end condition" is reached. The end condition can be a maximum value of the total CPU time, the wall time, the number of blocks, or a threshold on the statistical error bar of any Monte Carlo estimate.

The computation engine

The computation engine starts the Fortran MPI executable. Each MPI process computes one block, sends the results to the data server and waits for a reply. The reply of the data server determines if the MPI process should end or if it should start the calculation of another block. The transmission of the data is not realized through MPI, but using a Python script. This latter point avoids the need for synchronization barriers in the Fortran code.

A linear feedback shift register (LFSR) pseudo-random number generator[8] is implemented in the code. A pool of 7000 initial random seeds was previously prepared, each random seed being separated from the previous one by 6.1011 seeds. Every time a random number is drawn, a counter is incremented. If the counter reaches 6.1011, the next free random seed is used. This mechanism guarantees that the parallel processes will never use the same sequence of random numbers.

The data server

The data server is a Python XML-RPC server whose role is to receive the computed data and save it into files, each file containing the averages computed over a block. It also computes an MD5 key[9] related to the critical input values to guarantee that the computed blocks belong to the simulation, and that the input data has not been corrupted.

Generation of the output

The output file of the program is not created during the run which only produces block files via the data server. A separate script analyzes the blocks written to disk to produce an output. The consistence between the input data with the blocks is checked with using the previously mentioned MD5 key.

A script was written to prepare as much files as there are different requested tasks on the grid, each file name containing the index of the task (for example 12.tar.gz). Each file is a gzipped tar file of a directory containing all the needed input data. The only difference between the input files of two distinct tasks is their vector ${\textbf{X}}_0$ and their random seed S0. The script also generates a gzipped tar file qmcchem.tar.gz which contains a statically linked non-MPI Fortran i686-linux executable, and all the needed Python files (the XML-RPC data server, the observer, etc). A third generated file is a shell script qmcchem_grid.sh which will run the job:

1. unpack input file and the executables
2. set environment variables for a proper execution of the code
3. run the calculation
4. pack the input files with the produced blocks in a tar.gz file as the last positions of the walkers and the last random seed have been saved.

The job submission on the grid is done as a parametric job, where the parameter is the index of the task. The input sandbox contains the qmcchem.tar.gz, qmcchem_grid.sh and _PARAM_.tar.gz files (where _PARAM_ is the index of the task). The output sandbox contains the _PARAM_.tar.gz file, which has been updated if the calculation has run.

Another script reverses this work. First, it unpacks all the tar.gz files containing the input files and the block files. Then it collects all the block files into a common directory for the production of the output file.

The signals SIGKILL, SIGQUIT, SIGSTOP, SIGALRM, SIGINT and SIGTERM are intercepted by the QMC=Chem program. If any of this signals is caught, the program tries to finish the current block and terminates in a clean way.

Asynchronous processes

The MPI processes communicate when the program starts in order to broadcast the input data. As long as an MPI process receives the reply from the data server to compute another block, it does not make any MPI communication with the other processes. When a process receives the order to stop the calculation, it goes to an MPI-barrier, sends the current positions of the walkers and the current random seed to the master MPI process and exits. The main advantage of this design is that if the MPI processes are sent on machines with different types of processors, each process will always use 100% of the CPU (except for the synchronization barrier at the end) and fast processors will send more blocks to the data server than the slower processors.

Analysis of the results

The analysis of the blocks as a post-processing has major advantages. First, the analysis of the data with graphical user interfaces is trivial since all the data is present in the block files, and they are easy to read by programs, as opposed to traditional output files which are written for the users. The degree of verbosity of the output can be changed upon request by the user after the end of the calculation, and this avoids the user to read a large file to find only one small piece of information, while it is still possible to have access to the verbose output. The last and most important feature is that the production of the output does not impose any synchronization of the parallel processes, and they can run naturally in grid environments.

Failure of a process

As all the processes are independent, if one task dies it does not affect the other tasks. For instance, if a large job is sent on the grid and one machine of the grid has a power failure, the user may not even remark that part of the work has not been computed. Moreover, as the process signals are intercepted, if a batch queuing system tries to kill a task (because it has exceeded the maximum wall time of the queue, for example), the job is likely to end gracefully and the computed blocks will be saved. This fail-safe feature is essential in a grid environment where it is almost impredictable that all the requested tasks will end as expected.

Flexibility

The duration of a block can be tuned by the user by changing the number of steps per trajectory and the end condition can be the wall time, which is convenient in grid environments with multiple types of processors. When the end condition is reached, if the same job is sent again to the queuing system, it will automatically continue using the last walker positions and random seeds, and use the previously computed blocks to calculate the running averages. Note that between two subsequent runs, there is no constraint to use the same number of processors.

Computational Strategy and Details

\label{strategy}

A quantum Monte Carlo study has been performed on the Li2 molecule. The choice of such a system allowed us to consider a Full Configuration Interaction (FCI) wave-function as the trial wave function, i.e. a virtually exact solution of the Schroedinger equation in the subspace spanned by the gaussian orbital basis. We remind that the computational cost of the FCI problem scales combinatorially with the number of basis functions and electrons and, therefore high quality FCI wave functions are practically impossible to be obtained for much larger systems.

The Q5Cost common data format

Due to the inherent heterogeneity of grid architectures, and due to the necessity of using different codes, a common format for data interchange and interoperability is mandatory in the context of distributed computation. For this reason we have previously developped a specific data format and library for quantum chemistry[10], and its use for single processor and distributed calculations has already been reported[11]. The Q5Cost is based on the HDF5 format, a characteristic that makes the binary files portable on a multiple platform environment. Moreover the compression features of the HDF5 format are exploited to reduce significantly the file size while keeping all the relevant information and meta-data. Q5Cost contains chemical objects related data organized in a hierarchical structure within a logical containment relationship. Moreover a library to write and access Q5Cost files has been released[10]. The library, built on top of the HDF5 API, makes use of chemical concepts to access the different file objects. This feature makes the inclusion on quantum chemistry codes rather simple and straightforward, leaving the HDF5 low level technical details absolutely transparent to the chemical software developer. Q5Cost has emerged as an efficient tool to facilitate communication and interoperability and seems to be particularly useful in the case of distributed environments, and therefore well adapted to the grid.

Computational Details

All the preliminary FCI calculations have been realized on a single processor by using the Bologna FCI code.[12],[13] The code has been interfaced with Molcas[14] to get the necessary one- and two-electron integrals. The FCI computations considered in the present work involved up to 16,138 symmetry adapted and partially spin adapted determinants. All the communications between the different codes has been assured by using the Q5Cost format and library. In particular a module has been added to the Molcas code to produce a Q5Cost file containing the information on the molecular system and the atomic and molecular (self consistent field level) integrals. The Q5Cost file has been directly read by the FCI code, and the final FCI wave function has been added to the same file in a proper and standardized way. The actual QMC=Chem input has been prepared by a Python script reading the Q5Cost file content. Before running the QMC calculation on the grid an equilibration step was performed (i.e., building "good" starting configurations, ${\textbf{X}}_0$, for walkers) by doing a quick variational QMC run (see, Refs.[1] and [2]) in single processor mode.

QMC computations have been run on the EGEE grid over different computing elements and in a massively parallel way. Typically for each potential energy curve point we requested to use 1000 nodes, obtaining at least about 500 tasks running concurrently. Once the job on each node has been completed the results were retrieved and the output submitted to the post-processing averaging to obtain the averaged QMC energy. Due to the inherent flexibility of the QMC implementation the fact of having different tasks on different nodes terminating after different number of blocks did not cause any difficulty for the post-processing phase. Moreover, the failure or the abortion of some tasks did not influence the post-processing phase as well.

Distance (a.u.) Energy (a.u.)
2.2 -14.81854(44)
2.4 -14.85192(36)
2.6 -14.87861(36)
2.8 -14.90269(37)
3.0 -14.92195(35)
3.2 -14.93861(38)
3.4 -14.95324(37)
3.6 -14.96428(38)
3.8 -14.97320(38)
4.0 -14.98250(29)
4.2 -14.98629(27)
4.4 -14.99016(30)
4.6 -14.99285(29)
4.8 -14.99358(29)
5.0 -14.99479(29)
5.051 -14.99492(17)
5.2 -14.99491(29)
5.4 -14.99431(30)
5.6 -14.99279(30)
5.8 -14.99090(30)
6.0 -14.99018(32)
6.4 -14.98541(26)
6.8 -14.98088(29)
7.2 -14.97804(29)
7.6 -14.97281(28)
8.0 -14.96984(28)
10.0 -14.95951(27)
12.0 -14.95747(15)
14.0 -14.95624(12)
16.0 -14.95606(14)
18.0 -14.95569(17)
100.0 -14.95539(11)

Table 1 : Quantum Monte Carlo energies (atomic units) as a function of the Li-Li distance (atomic units). Values in parenthesis correspond to the statistical error on the two last digits.

Our results for the potential energy curve of the Li2 molecule are presented in figure 3 (graphical form) and Table 1 (raw data and error bars). Results are given for a set of 31 inter-nuclear distances. Let us emphasize that the data are of a very high-quality and the energy curve presented is, to the best of our knowledge, the most accurate energy curve ever published. To illustrate this point, let us mention that the dissociation energy defined as $D_e \equiv E(R_{\rm eq})- E(R=\infty)$ is found to be here De = − 0.0395(2) a.u. in excellent agreement with the experimental result of De = − 0.03928 a.u., Ref.[15]. A much more detailed discussion of these results and additional data will be presented elsewhere[16].

Results and Grid Performance

\label{performance}

For this application only 11Mb of memory per Fortran process was needed to allow the computation of the blocks. This feature is twofold. First, it allowed our jobs to be selected rapidly by the batch queuing systems as very few resources were requested. Second, as the memory requirements are very low, the code was able to run on any kind of machine. This allowed our jobs to enter both the 32-bit (x86) and 64-bit (x86_64) queues.

Grid Performance

A typical analysis of the grid performance can be seen on figure 4. Here we report the number of queued, running and executed tasks as a function of time for the submission of a parametric job of 1000 tasks.

It is also noteworthy to comment the distribution of the CPU time per block for each task (Figure 5), this difference reflects the fact that some tasks have been performed on slower nodes. Again this asymetry did not influence the final result since the final statistics were performed on the total amount of blocks.

Conclusions

\label{conclu}

An efficient grid implementation of a massively parallel quantum Monte Carlo code has been presented. The strategy employed has been presented in detail. Some test applications have been performed on the EGEE-III grid architecture showing the efficiency and flexibility of our implementation. As shown, our approach enables to exploit the computational power and resources of the grid for the solution of a non-trivial N-body problem of chemistry. It must be emphasized that the very important computational gain obtained here has concerned the simulation of the Shroedinger equation for a single fixed nuclear geometry (single point energy calculation). This is in sharp contrast with the common situation in computational chemistry where parallelism is not used (or partially used) for solving the problem at hand (algorithms are in general poorly parallelized) but rather for making independent simulations at different nuclear geometries (trivial parallelization based on different inputs). We believe that the quantum Monte Carlo approach which is based on Markov chain processes and on the accumulation of statistics for independent events can represent an ideal test bed for the use of grid environments in computational chemistry. Finally, we note that the combination of grid computing power and of the QMC ability to treat chemical problems at a high-level of accuracy can open the way to the possibility of studying fascinating problems (from the domain of nano-sciences to biological systems) which are presently out of reach.

Acknowledgments

Support from the French CNRS and from University of Toulouse is gratefully acknowledged. The authors also wish to thank the French PICS action 4263 and European COST in Chemistry D37 GridChem. Acknowledgments are also due to the EGEE-III grid organization and to the CompChem virtual organization.

References

1. 1.0 1.1 1.2 W.M.C.Foulkes, L. Mitás, R.J. Needs, and G. Rajogopal, Rev. Mod. Phys. 73, 33 (2001).
2. 2.0 2.1 2.2 2.3 B.L. Hammond, W.A. Lester Jr., and P.J. Reynolds, Monte Carlo Methods in Ab Initio Quantum Chemistry, World Scientific (1994).
3. 3.0 3.1 http://www.eu-egee.org/
4. 4.0 4.1 T.H. Dunning Jr. J. Chem. Phys. 90, 1007 (1989).
5. 5.0 5.1 QMC=Chem is a general-purpose quantum Monte Carlo code for electronic structure calculations. Developped by M. Caffarel, A. Scemama, and collaborators at Lab. de Chimie et Physique Quantiques, CNRS and Université de Toulouse, http://qmcchem.ups-tlse.fr
6. 6.0 6.1 Gropp, W., Lusk, E., Doss, N., Skjellum, A., Parallel Computing, North-Holland 22, 789-828, (1996).
7. 7.0 7.1 http://www.python.org/
8. 8.0 8.1 P. L'Ecuyer, Math. of Comput. 68, 261-269 (1999).
9. 9.0 9.1 R. L. Rivest., Technical report, Internet Activities Board April (1992).
10. 10.0 10.1 10.2 A. Scemama, A. Monari, C. Angeli, S. Borini,S. Evangelisti, E. Rossi, O. Gervasi et al. (Eds.) Lecture Notes in Computer Science, Computational Science and Its Applications, ICCSA, Part I, LNCS 5072, 1094–1107 (2008).
11. 11.0 11.1 V. Vetere, A. Monari, A. Scemama, G. L. Bendazzoli, S. Evangelisti J. Chem. Phys. 130, 024301 (2009).
12. 12.0 12.1 G.L. Bendazzoli and S. Evangelisti, J. Chem. Phys. 98, 3141-3150 (1993).
13. 13.0 13.1 L. Gagliardi, G.L. Bendazzoli, S. Evangelisti, J. Comp. Chem. 18, 1329 (1997).
14. 14.0 14.1 G. Karlström, R. Lindh, P.-A. Malmqvist, B. O. Roos, U. Ryde, V. Veryazov, P.-O. Widmark, M. Cossi, B. Schimmelpfennig, P. Neogrady, L. Seijo, Computational Material Science 28, 222 (2003).
15. 15.0 15.1 C. Filippi and C.J. Umrigar, J. Chem. Phys. 105, 213 (1996).
16. 16.0 16.1 A. Monari, A. Scemama, and M. Caffarel "A quantum Monte Carlo study of the fixed-node Li2 potential energy curve using Full Configuration Interaction nodes" (unpublished).