The atmospheric concentrations of weakly bound molecular clusters can be computed from the thermochemical properties of low energy structures found through a multi-step configurational sampling methodology utilizing a genetic algorithm and semi-empirical and ab initio quantum chemistry.
The computational study of the formation and growth of atmospheric aerosols requires an accurate Gibbs free energy surface, which can be obtained from gas phase electronic structure and vibrational frequency calculations. These quantities are valid for those atmospheric clusters whose geometries correspond to a minimum on their potential energy surfaces. The Gibbs free energy of the minimum energy structure can be used to predict atmospheric concentrations of the cluster under a variety of conditions such as temperature and pressure. We present a computationally inexpensive procedure built on a genetic algorithm-based configurational sampling followed by a series of increasingly accurate screening calculations. The procedure starts by generating and evolving the geometries of a large set of configurations using semi-empirical models then refines the resulting unique structures at a series of high-level ab initio levels of theory. Finally, thermodynamic corrections are computed for the resulting set of minimum-energy structures and used to compute the Gibbs free energies of formation, equilibrium constants, and atmospheric concentrations. We present the application of this procedure to the study of hydrated glycine clusters under ambient conditions.
The most uncertain parameter in atmospheric studies of climate change is the exact extent to which cloud particles reflect incoming solar radiation. Aerosols, which are particulate matter suspended in a gas, form cloud particles called cloud condensation nuclei (CCN) that scatter incoming radiation, thus preventing its absorption and the subsequent heating of the atmosphere1. A detailed understanding of this net cooling effect requires an understanding of the growth of aerosols into CCNs, which in turn requires an understanding of the growth of small molecular clusters into aerosol particles. Recent work has suggested that aerosol formation is initiated by molecular clusters of 3 nm in diameter or less2; however, this size regime is difficult to access using experimental techniques3,4. Therefore, a computational modeling approach is desired in order to overcome this experimental limitation.
Using our modeling approach described below, we can analyze the growth of any hydrated cluster. Because we are interested in the role of water in the formation of large biological molecules from smaller constituents in pre-biotic environments, we illustrate our approach with glycine. The challenges encountered and tools needed to address those research questions are very similar to those involved in the study of atmospheric aerosols and prenucleation clusters5,6,7,8,9,10,11,12,13,14,15. Here, we examine hydrated glycine clusters starting from an isolated glycine molecule followed by a series of stepwise additions of up to five water molecules. The final goal is to calculate the equilibrium concentrations of Gly(H2O)n=0-5 clusters in the atmosphere at room temperature at sea-level and a relative humidity (RH) of 100 %.
A small number of these sub-nanometer molecular clusters grow into a metastable critical cluster (1-3 nm in diameter) either by adding other vapor molecules or coagulating on existing clusters. These critical clusters have a favorable growth profile leading to the formation of much larger (up to 50-100 nm) cloud condensation nuclei (CCN), which directly affect the precipitation efficiency of clouds as well their ability to reflect incident light. Therefore, having a good understanding of the thermodynamics of molecular clusters and their equilibrium distributions should lead to more accurate predictions of the impact of aerosols on the global climate.
A descriptive model of aerosol formation requires accurate thermodynamics of molecular cluster formation. The computation of accurate thermodynamics of molecular cluster formation requires the identification of the most stable configurations, which involves finding the global and local minima on the cluster's potential energy surface (PES)16. This process is called configurational sampling and can be achieved through a variety of techniques, including those based on molecular dynamics (MD)17,18,19,20, Monte Carlo (MC)21,22, and genetic algorithms (GA)23,24,25.
Different protocols have been developed over the years to obtain the structure and thermodynamics of atmospheric hydrates at a high level of theory. These protocols differed in the choice of (i) configurational sampling method, (ii) nature of low-level method used in the configurational sampling, and (iii) the hierarchy of higher-level methods used to refine the results in the subsequent steps.
The configurational sampling methods included chemical intuition26, random sampling27,28, molecular dynamics (MD)29,30, basin hopping (BH)31, and genetic algorithm (GA)24,25,32. The most common low-level methods employed with these sampling methods are force fields or semi-empirical models such as PM6, PM7 and SCC-DFTB. These are often followed by DFT calculations with increasingly larger basis sets and more reliable functionals from the higher rungs of Jacob's ladder33. In some cases, these are followed by higher level wavefunction methods such as MP2, CCSD(T), and the cost efficient DLPNO-CCSD(T)34,35.
Kildgaard et al.36 developed a systematic method where water molecules are added at points on the Fibonacci spheres37 around smaller hydrated or unhydrated clusters to generate candidates for larger clusters. Unphysical and redundant candidates are removed based on close contact thresholds and root-mean-square distance between different conformers. Subsequent optimizations using the PM6 semi-empirical method and a hierarchy of DFT and wavefunction methods are used to get a set of low energy conformers at a high level of theory.
The artificial bee colony (ABC) algorithm38 is a new configurational sampling approach that has recently been implemented by Zhang et al. to study molecular clusters in a program called ABCluster39. Kubecka et al.40 used ABCluster for configurational sampling followed by low-level reoptimizations using the tight-binding GFN-xTB semi-empirical method41. They further refined the structures and energies using DFT methods followed by final energies using DLPNO-CCSD(T).
Regardless of the method, configurational sampling starts with a randomly- or nonrandomly-generated distribution of points on the PES. Each point corresponds to a specific geometry of the molecular cluster in question and is generated by the sampling method. Then the closest local minimum is found for each point by following the "downhill" direction on the PES. The set of minima thus found correspond to those geometries of the molecular cluster that are stable, at least for some time. Here, the shape of the PES and the evaluation of the energy at each point on the surface will be sensitive to the physical description of the system where a more accurate physical description results in a more computationally expensive energy calculation. We will specifically use the GA method implemented in the OGOLEM25 program, which has been successfully applied to a variety of global optimization and configurational sampling problems42,43,44,45, to generate the initial set of sampling points. The PES will be described by the PM7 model46 implemented in the MOPAC2016 program47. This combination is employed because it generates a larger variety of points compared to the MD and MC methods and finds the local minima faster than more-detailed descriptions of the PES.
The set of GA-optimized local minima are taken as the starting geometries for a series of screening steps, which lead to a set of low-lying minimum energy. This part of the protocol begins by optimizing the set of unique GA-optimized structures using density-functional theory (DFT) with a small basis set. This set of optimizations will generally give a smaller set of unique local minimum structures which are modeled in more detail compared to the GA-optimized semi-empirical structures. Then another round of DFT optimizations are performed on this smaller set of structures using a larger basis set. Again, this step will generally give a smaller set of unique structures which are modeled in more detail compared to the small basis DFT step. The final set of unique structures are then optimized to a tighter convergence and the harmonic vibrational frequencies are calculated. After this step we have everything we need to compute the equilibrium concentrations of the clusters in the atmosphere. The overall approach is summarized diagrammatically in Figure 1. We will use the PW9148 generalized-gradient approximation (GGA) exchange-correlation functional in the Gaussian0949 implementation of DFT along with two variations of the Pople50 basis set (6-31+G* for the small basis step and 6-311++G** for the large basis step). This particular combination of exchange-correlation functional and basis set was chosen due to its previous success in computing accurate Gibbs free energies of formation for atmospheric clusters51,52.
This protocol assumes that the user has access to a high-performance computing (HPC) cluster with the portable batch system53 (PBS), MOPAC2016 (http://openmopac.net/MOPAC2016.html)47, OGOLEM (https://www.ogolem.org)25, Gaussian 09 (https://gaussian.com)49, and OpenBabel54 (http://openbabel.org/wiki/Main_Page) software installed following their specific installation instructions. Each step in this protocol also uses a set of in-house shell and Python 2.7 scripts which must be saved to a directory that is included in the user's $PATH environmental variable. All necessary environmental modules and execution permissions to run all of the above programs must also be loaded into the user's session. The disk and memory usage by the GA code (OGOLEM) and semi-empirical codes (MOPAC) are very small by modern computer resource standards. The overall memory and disk usage for OGOLEM/MOPAC depends on how many threads one wants to use, and even then, the resource usage will be small compared to the capabilities of most HPC systems. The resource needs of the QM methods depend on the size of the clusters and the level of theory used. The advantage of using this protocol is that one can vary the level of theory to be able to calculate the final set of low energy structures, keeping in mind that usually faster calculations lead to more uncertainty in accuracy of the results.
For the sake of clarity, the user's local computer will be referred to as "local computer" while the HPC cluster they have access to will be referred to as "remote cluster".
1. Finding the minimum energy structure of isolated glycine and water
NOTE: The goal here is two-fold: (i) to obtain minimum energy structures of isolated water and glycine molecules for use in the genetic algorithm configurational sampling, (ii) and to compute the thermodynamic corrections to the gas phase energies of these molecules for use in the calculation of atmospheric concentrations.
2. Genetic-algorithm-based configurational sampling of Gly(H2O)n=1-5 clusters
NOTE: The goal here is to obtain a set of low-energy structures for Gly(H2O)n=1-5 at the inexpensive semi-empirical level of theory, using the PM746 model implemented in MOPAC47. It is imperative that the working directory has the exact organization and structure as shown in Figure 2. This is to ensure that the custom shell and Python scripts work without failures.
3. Refinement using QM method with a small basis set
NOTE: The goal here is to refine the configurational sampling of the Gly(H2O)n=1-5 clusters using a better quantum-mechanical description to obtain a smaller but more accurate set of Gly(H2O)n=1-5 cluster structures. The starting structures for this step are the outputs of Step 2.
4. Further refinement using QM method with a large basis set
NOTE: The goal here is to further refine the configurational sampling of the Gly(H2O)n=1-5 clusters using a better quantum-mechanical description. The starting structures for this step are the outputs of Step 3.
5. Final Energy and Thermodynamic Correction Calculations
NOTE: The goal here is to obtain the vibrational structure and energies of the Gly(H2O)n=1-5 clusters using a large basis set and an ultrafine integration grid in order to compute the desired thermochemical corrections.
6. Computing atmospheric concentrations of Gly(H2O)n=0-5 clusters at room temperature at sea-level
NOTE: This is accomplished by first copying the thermodynamic data generated in the previous step into a spreadsheet and calculating the Gibbs free energies of sequential hydration. Then, the Gibbs free energies are used to calculate equilibrium constants for each sequential hydration. Finally, a set of linear equations are solved to get the equilibrium concentration of the hydrates for a given concentration of monomers, temperature and pressure.
The first set of results from this protocol should be a set of low-energy structures of Gly(H2O)n=1-5 found through the configurational sampling procedure. These structures have been optimized at the PW91/6-311++G** level of theory and are assumed to be accurate for the purpose of this paper. There is no evidence to suggest that PW91/6-311++G** consistently underestimates or overestimates the binding energy of these clusters. Its ability to predict binding energies relative to MP2/CBS32 and [DLPNO-]CCSD(T)/CBS60,61 estimates and experiment52 shows a lot of fluctuations. The same is true of most other density functionals. Generally, each value of n = 1 – 5 should yield a handful of low-energy structures within around 5 kcal mol-1 of the lowest-energy structure. Here, we focus on the first structure produced by the run-thermo-pw91.csh script for brevity. Figure 3 shows the lowest electronic energy isomers of Gly(H2O)n=0-5 clusters. One can see that the hydrogen bond network grows in complexity as the number of water molecules increases, and even goes from a mostly planar network to a three-dimensional cage-like structure at n = 5. The rest of this text uses the energies and thermodynamic quantities corresponding to these five specific clusters.
Table 1 contains the thermodynamic quantities necessary to carry out the protocol. Table 2 shows an example of the output of the run-thermo-pw91.csh script where the electronic energies, vibrational zero-point corrections, and the thermodynamic corrections at three different temperatures are printed. For each cluster (row), E[PW91/6-311++G**] corresponds to the gas phase electronic energies at the PW91/6-311++G** level of theory calculated on ultrafine integration grids in units of Hartree, as well as the zero-point vibrational energy (ZPVE) in units of kcal mol-1. At each temperature, 216.65 K, 273.15 K, and 298.15 K, the thermodynamic corrections are listed, ∆H the enthalpy of formation in units of kcal mol-1, S the entropy of formation in units of cal mol-1, and ∆G the Gibbs free energy of formation in units of kcal mol-1. Table 3 shows an example computation of the total Gibbs free energy change of hydration, as well as for sequential hydration. An example computation of the total Gibbs free energy change of hydration for the reaction
starts with the computation of the electronic energy EPW91 as
where EPW91[Gly∙(H2O)] is taken from Table 2 column C, and EPW91[Gly] and EPW91[H2O] are taken from Table 1 column B. Next we calculate the total gas phase energy change ΔE(0) by including the change in the zero-point vibrational energy of the reaction as
to obtain column D. Here, ΔEPW91/6−311++G** is taken from Table 3 column C, EZPVE[Gly ∙ (H2O)] from Table 2 column D, and EZPVE[Gly] and EZPVE[H2O] from Table 1 column C. For the sake of brevity, we will move on to room temperature clusters, so we skip over the 216.65 K and 273.15 K data. At room temperature, we then calculate the enthalpy change of the reaction ΔH by correcting the gas phase energy change as
where ΔE(0) is taken from Table 3 column D, ΔH[Gly∙(H2O)] is taken from Table 2 column K, and ΔH[Gly] and ΔH[H2O] are taken from Table 1 column J. Finally, we calculate the Gibbs free energy change of the reaction ΔG as
where ΔH is taken from Table 3 column I, S[Gly∙(H2O)] is taken from Table 2 column L, and S[Gly] and S[H2O] are taken from Table 1 column K. Note here that the entropy values must be converted to units of kcal mol-1 K-1 during this step.
We now have the necessary quantities to compute the atmospheric concentrations of hydrated glycine as shown in Step 6. The results should resemble the data shown in Table 4, but small numerical differences are to be expected. Table 4 shows the equilibrium hydrate concentrations found from the formulation of the system of six equations in Step 6.2 into one matrix equation and its subsequent solution. We start by acknowledging the fact that the system of equations can be written as
where Kn is the equilibrium constant for the nth sequential hydration of glycine, w is the concentration of water in the atmosphere, g is the initial concentration of isolated glycine in the atmosphere, and gn is the equilibrium concentration of Gly(H2O)n. If we rewrite the above equation as Ax = b, we get x = A−1b where A−1 is the inverse of matrix A. This inverse can be easily computed using built-in spreadsheet functions as shown in Table 4 to obtain the final results.
Figure 4 shows the equilibrium concentration of hydrated glycine calculated in Table 4 as a function of temperature at 100% relative humidity and 1 atmosphere pressure. It shows that, as temperature decreases from 298.15K to 216.65K, the concentration of unhydrated glycine (n=0) decreases and those of hydrated glycine increases. The glycine dihydrate (n=2) in particular increases dramatically with decreasing temperature while the change in the concentration of other hydrates is less noticeable. These inverse correlation between temperature and hydrate concentration is consistent with the expectation that lower Gibbs free energies of hydrations at lower temperatures favor the formation of hydrates.
Figure 5 illustrates the relative humidity dependence of equilibrium concentration of glycine hydrates at 298.15K and 1 atmosphere pressure. It clearly demonstrates that as RH increases from 20% to 100%, the concentration of hydrates (n>0) increase at the expense of unhydrated glycine (n=0). Once again the direct correlation between the relative humidity and concentration of hydrates is consistent with the idea that the presence of more water molecules at higher RH promotes the formation of hydrates.
As presented, this protocol gives a qualitative understanding of the hydrated glycine populations in the atmosphere. Assuming an initial concentration of isolated glycine of 2.9 million molecules per cubic centimeter, we see that the unhydrated glycine (n=0) is the most abundant species under most conditions except T=216.65K and RH=100%. The dihydrate (n=2), which has the lowest sequential Gibbs free energy of hydration at all three temperatures, is the most abundant hydrate at the conditions considered here. The monohydrate (n=1) and larger hydrates (n≥3) are predicted to be found in negligible amounts. Upon inspection of Figure 3, the abundance of the n = 1–4 clusters can be related to the stability and strain in the hydrogen bond network of the clusters. These clusters have the water molecules hydrogen bonded to the carboxylic acid moiety of glycine in a geometry closely resembling those of various hydrogen-bonded ring structures, making them especially stable.
Figure 1: Schematic description of the current procedure. A large pool of guess structures generated by the genetic algorithm (GA) is refined by a series of PW91 geometry optimizations until a set of converged structures are obtained. The vibrational frequencies of these structures are computed and used to compute the Gibbs free energy of formation, which is in turn used to compute the equilibrium concentrations of the clusters under ambient conditions. Please click here to view a larger version of this figure.
Figure 2: Representative directory structure for each cluster. The in-house scripts included in this protocol require the directory structure shown above, where n is the number of water molecules. For each n in gly-h2o-n, there are the following subdirectories: GA for genetic algorithm with a GA/pm7 directory, QM for quantum mechanics with QM/pw91-sb for PW91/6-31+G*, QM/pw91-lb for PW91/6-311++G**, and QM/pw91-lb/ultrafine for optimizations and final vibrational calculations on ultrafine integration grids. Please click here to view a larger version of this figure.
Figure 3: Representative low energy structures of Gly(H2O)n=0-5. These clusters were the electronic energy global minima optimized at the PW91/6-311++G** level of theory. Please click here to view a larger version of this figure.
Figure 4: Temperature dependence of Gly(H2O)n=0-5 as 100% relative humidity and 1 atm pressure. The concentration of the hydrates is given in units of molecules cm-3. Please click here to view a larger version of this figure.
Figure 5: Relative humidity dependence of Gly(H2O)n=0-5 as 298.15 K and 1 atm pressure. The concentration of the hydrates is given in units of molecules cm-3. Please click here to view a larger version of this figure.
E[PW91/6-311++G**] | 216.65 K | 273.15 K | 298.15 K | ||||||||
LB-UF | ZPVE | ∆H | S | ∆G | ∆H | S | ∆G | ∆H | S | ∆G | |
water | -76.430500 | 13.04 | 1.72 | 42.59 | 5.54 | 2.17 | 44.44 | 3.08 | 2.37 | 45.14 | 1.96 |
glycine | -284.434838 | 48.55 | 2.65 | 69.53 | 36.14 | 3.70 | 73.81 | 32.09 | 4.22 | 75.61 | 30.22 |
Table 1: Monomer energies. Electronic energies are in units of Hartree while all other quantities are in units of kcal mol-1. Water and glycine were optimized at the PW91/6-311++G** level of theory and vibrational frequencies were computed. The thermodynamic corrections for a pressure of 1 atm and temperature of 298.15 K were computed using the thermo.pl script.
E[PW91/6-311++G**] | 0 K | 216.65 K | 273.15 K | 298.15 K | ||||||||
n | 氏名 | LB-UF | ZPVE | ∆H | S | ∆G | ∆H | S | ∆G | ∆H | S | ∆G |
1 | gly-h2o-1 | -360.88481 | 63.96 | 3.61 | 80.12 | 50.22 | 5.12 | 86.27 | 45.52 | 5.85 | 88.83 | 43.33 |
2 | gly-h2o-2 | -437.33763 | 79.33 | 4.53 | 90.86 | 64.17 | 6.46 | 98.78 | 58.81 | 7.40 | 102.06 | 56.30 |
3 | gly-h2o-3 | -513.78620 | 94.52 | 5.67 | 105.08 | 77.42 | 8.08 | 114.94 | 71.19 | 9.23 | 119.00 | 68.27 |
4 | gly-h2o-4 | -590.23667 | 109.80 | 6.03 | 104.98 | 91.30 | 8.78 | 116.21 | 84.40 | 10.11 | 120.87 | 81.14 |
5 | gly-h2o-5 | -666.68845 | 125.80 | 7.26 | 121.70 | 106.69 | 10.47 | 134.83 | 99.44 | 12.01 | 140.24 | 96.00 |
Table 2: Cluster energies. The energies of the lowest-energy Gly(H2O)n=1-5 structures found using our procedure outlined in Figure 1. Electronic energies are in units of Hartree while all other quantities are in units of kcal mol-1.
Total Hydration: Gly + nH2O <-> Gly(H2O)n | Sequential Hydration: Gly(H2O)n-1 + H2O <-> Gly(H2O)n | ||||||||||||||||
E[PW91/6-311++G**] | 216.65 | 273.15 | 298.15 | 216.65 | 273.15 | 298.15 | |||||||||||
n | system name | LB-UF | ∆E(0) | ∆H(T) | ∆G(T) | ∆H(T) | ∆G(T) | ∆H(T) | ∆G(T) | LB-UF | ∆E(0) | ∆H(T) | ∆G(T) | H(T) | ∆G(T) | ∆H(T) | ∆G(T) |
1 | gly-h2o-1 | -12.22 | -9.85 | -10.61 | -3.68 | -10.61 | -1.87 | -10.59 | -1.07 | -12.22 | -9.85 | -10.61 | -3.68 | -10.61 | -1.87 | -10.59 | -1.07 |
2 | gly-h2o-2 | -26.22 | -21.53 | -23.10 | -9.27 | -23.11 | -5.66 | -23.09 | -4.06 | -14.00 | -11.68 | -12.49 | -5.59 | -12.50 | -3.79 | -12.50 | -2.99 |
3 | gly-h2o-3 | -37.56 | -30.72 | -32.88 | -12.90 | -32.87 | -7.69 | -32.82 | -5.38 | -11.34 | -9.19 | -9.78 | -3.63 | -9.76 | -2.03 | -9.73 | -1.32 |
4 | gly-h2o-4 | -50.10 | -40.34 | -43.48 | -15.87 | -43.54 | -8.71 | -43.51 | -5.55 | -12.54 | -9.62 | -10.60 | -2.97 | -10.67 | -1.02 | -10.69 | -0.17 |
5 | gly-h2o-5 | -63.45 | -51.41 | -55.42 | -20.58 | -55.51 | -11.48 | -55.48 | -7.45 | -13.35 | -11.07 | -11.94 | -4.71 | -11.97 | -2.77 | -11.97 | -1.90 |
Table 3: Hydration energies. The total energy of hydration and energy of sequential hydration for Gly(H2O)n=1-5 in units of kcal mol-1. Here, E[PW91/6-311++G**] is the change in the electronic energy, ∆E(0) is the zero-point vibrational energy (ZPVE) corrected change in energy, ∆H(T) is the enthalpy change at temperature T, and ∆G(T) is the Gibbs free energy change of hydration of each Gly(H2O)n=1-5 cluster.
Equilibrium Hydrate Distribution as a function of temperature and relative humidity | |||||||||
T=298.15K | T=273.15K | T=216.65K | |||||||
Gly(H2O)n | RH=100% | RH=50% | RH=20% | RH=100% | RH=50% | RH=20% | RH=100% | RH=50% | RH=20% |
0 | 1.3E+06 | 2.2E+06 | 2.7E+06 | 1.1E+06 | 2.0E+06 | 2.7E+06 | 6.1E+05 | 1.5E+06 | 2.5E+06 |
1 | 2.3E+05 | 1.9E+05 | 9.5E+04 | 2.0E+05 | 1.9E+05 | 9.9E+04 | 1.2E+05 | 1.5E+05 | 9.5E+04 |
2 | 1.0E+06 | 4.3E+05 | 8.4E+04 | 1.3E+06 | 6.1E+05 | 1.3E+05 | 1.8E+06 | 1.1E+06 | 3.0E+05 |
3 | 2.8E+05 | 5.8E+04 | 4.5E+03 | 3.2E+05 | 7.4E+04 | 6.3E+03 | 3.1E+05 | 9.6E+04 | 1.0E+04 |
4 | 1.1E+04 | 1.1E+03 | 3.4E+01 | 1.3E+04 | 1.5E+03 | 5.0E+01 | 1.1E+04 | 1.8E+03 | 7.5E+01 |
5 | 7.5E+03 | 3.9E+02 | 4.9E+00 | 1.2E+04 | 7.2E+02 | 9.7E+00 | 2.4E+04 | 1.9E+03 | 3.1E+01 |
Table 4: Equilibrium hydrate concentrations of Gly(H2O)n=0-5 as a function temperature (T=298.15K, 273.15K, 216.65K) and relative humidity (RH=100%, 50%, 20%). The concentration of the hydrates is given in units of molecules cm-3 assuming experimental values56,57,58, of [Gly]0 = 2.9 x 106 cm-3 and [H2O] = 7.7 x 1017 cm-3, 1.6 x 1017 cm-3 and 9.9 x 1014 cm-3 at 100% relative humidity and T = 298.15 K, 273.15 K, and 216.65 K, respectively59.
Supplemental Files. Please click here to download these files.
The accuracy of the data generated by this protocol depends mainly on three things: (i) the variety of configurations sampled by Step 2, (ii) the accuracy of the electronic structure of the system, (iii) and the accuracy of the thermodynamic corrections. Each of these factors can be addressed by modifying the method by editing the included scripts. The first factor is easily overcome with the use of a larger initial pool of randomly generated structures, more numerous iterations of the GA, and a looser definition of the criteria involved in the GA. In addition, one may use a different semi-empirical method such as the self-consistent charge density-functional tight-binding (SCC-DFTB)62 model and the effective fragment potential (EFP)63 model in order to explore the effects of different physical descriptions. The main limitation here is the inability of the method to form or break covalent bonds, meaning that the monomers are frozen. The GA procedure only finds the most stable relative positions of these frozen monomers according to the semi-empirical description.
The accuracy of the electronic structure of the system can be improved in a variety of ways, each with its computational cost. One may choose a better density functional, such as M06-2X64 and wB97X-V65, or quantum mechanical (QM) method such as the Møller-Plesset66,67,68 (MPn) perturbation theories and coupled-cluster69 (CC) methods in order to improve the physical description of the system. In the hierarchy of functionals, the performance generally improves upon going from generalized-gradient approximation (GGA) functionals like PW91 to range-separated hybrid functionals like wB97X-D and meta-GGA hybrid functionals like M06-2X.
The disadvantage of DFT methods is that a systematic convergence towards an accurate value is not possible; however, DFT methods are computationally inexpensive and there is a wide variety of functionals for a wide variety of applications.
Energies calculated using wavefunction methods like MP2 and CCSD(T) in conjunction with correlation consistent basis sets of increasing cardinal number ([aug-]cc-pV[D,T,Q,…]Z) converge towards their complete basis set limit systematically, but the computational cost of each calculation becomes prohibitive as the system size grows. Further refinement of the electronic structure can be accomplished by using explicitly correlated basis sets70 and by extrapolating to the complete basis set (CBS)71 limit. Our recent work suggests that a density-fitted explicitly correlated second-order Møller-Plesset (DF-MP2-F12) perturbative approach yields energies approaching that of MP2/CBS computations32. Modification of the current protocol to use different electronic structure methods involves two steps: (i) prepare a template input file following the syntax given by the software, (ii) and edit the run-pw91-sb.csh, run-pw91-lb.csh, and run-pw91-lb-ultrafine.csh scripts to generate the correct input file syntax as well as the correct submit script for the software.
Lastly, the accuracy of the thermodynamic corrections depends on the electronic structure method as well as the description of the PES around the global minimum. An accurate description of the PES requires the computation of third- and higher-order derivatives of the PES with respect to displacements in the nuclear degrees of freedom, such as the quartic force field72,73 (QFF), which is an exceptionally costly task. The current protocol uses the harmonic oscillator approximation to the vibrational frequencies, resulting in the need to compute only up to second derivatives of the PES. This approach becomes problematic in systems with high anharmonicity, such as very floppy molecules and symmetric double-well potentials due to the large difference in the true PES and the harmonic PES. Furthermore, the cost of having a high-quality PES from a computationally demanding electronic structure method only compounds the problem of cost for vibrational frequency calculations. One approach to overcome this is to use the electronic energies from a high-quality electronic structure calculation along with vibrational frequencies computed on a lower quality PES, resulting in a balance between cost and accuracy. The current protocol can be modified to use different PES descriptions as described in the previous paragraph; however, one may also edit the vibrational frequency keywords in the scripts and templates to compute anharmonic vibrational frequencies.
Two crucial issues for any configurational sampling protocol are the initial method for sampling the potential energy surface and the criteria used to identify each cluster. We have made extensive use of a variety of methods in our previous work. For the first issue, the initial method for sampling the potential energy surface, we have made the choice of using GA with semi-empirical methods based on these factors. Configurational sampling using chemical intuition26, random sampling, and molecular dynamics (MD)29,30, fail to find putative global minima regularly for clusters larger than 10 monomers, as we observed in our studies of water clusters18. We have successfully used basin hopping (BH) to study the complex PES of (H2O)1174, but it required the manual inclusion of some potential low energy isomers the BH algorithm did not find. A comparison of the performance of BH and GA in finding the global minimum of water clusters, (H2O)n=10-20 demonstrated that GA consistently found the global minimum faster than BH75. GA as implemented in OGOLEM and CLUSTER is very versatile because it can be applied to any molecular cluster and it can interface with a vast number of packages with classical force field, semi-empirical, density functional, and ab initio capabilities. The choice of PM7 is driven by its speed and reasonable accuracy. Virtually any other semi-empirical method would have significantly higher computational cost.
As for the second issue, we have explored using different criteria to identify unique structures ranging from electronic energies, dipole moments, overlap RMSDs and rotational constants. Using dipole moments proved difficult because both the dipole moment components were dependent on the molecule's orientation and the total dipole moment was very sensitive to geometry differences in such a way that it was difficult to set thresholds determining is structures are the same or unique. A combination of electronic energies and rotational constants proved to be most useful.
The current criteria for deeming two structures unique is based on an energy difference threshold of 0.10 kcal mol-1 and rotational constant difference of 1%. Therefore, two structures are considered different if their energies differ by more than 0.10 kcal mol-1 (~0.00015 a.u.) AND any of their three rotational constants (A, B, C) differ by more than 1%. Substantial internal benchmarks over the years found these thresholds to be reasonable choices. Our configurational sampling approach and screening methodology has been applied to very weakly bound clusters such as polyaromatic hydrocarbons complexed with water76,77 as well as strongly bound ternary sulfate hydrates containing ammonia and amines32. For clusters where there are different protonation states to be considered, the best approach is to run various GA calculations, each starting with monomers in different protonation states. This ensures that structures with different protonation states are carefully considered. However, the low-level DFT calculations often allow protonation states to change during the course of the geometry optimization, thereby yielding the most stable protonation state regardless of the starting geometry.
Our GA configurational sampling methods should work well even for floppy molecules as long as the GA codes are interfaced with general, non-parameterized methods that allow the monomers to adopt different configurations during the course of the GA run. For example, interfacing GA with PM7 would allow monomers' structures to change, but if their bonds break as would happen when protonation states change, the structures may get discarded as unacceptable candidates.
We have considered different ways of correcting the shortcomings of the harmonic approximation, especially those arising from low vibrational frequencies. Incorporating the quasi-harmonic approximation into the current methodology is not difficult. However, there are still questions about the quasi-harmonic method, especially when it comes to the cutoff frequency below which it will be applied. Also, there are no rigorous benchmarking works examining the reliability of the quasi-RRHO approximation even though conventional wisdom suggests it should be an improvement over RRHO approximation.
The protocol thus presented may be generalized to any system of noncovalently-bound gas phase molecular clusters. It may also be generalized to use any semi-empirical method, electronic structure method and software, and vibrational analysis method and software by editing the scripts and templates. This assumes that the user is comfortable with the Linux command-line interface, Python scripting, and high-performance computing. The unfamiliar syntax and look of the Linux operating system and lack of scripting experience is the largest pitfall in this protocol and is where new students struggle the most. This protocol has been used successfully in a variety of implementations for years in our group, mostly focusing on the effects of sulfuric acid and ammonia on aerosol formation. Further improvements to this protocol will involve a more robust interface to more electronic structure software, alternative implementations of the genetic algorithm, and possibly the use of newer methods for faster computations of electronic and vibrational energies. Our current applications of this protocol are exploring the importance of amino acids in the early stages of aerosol formation in the current atmosphere and in the formation of larger biological molecules in prebiotic environments.
The authors have nothing to disclose.
This project was supported by grants CHE-1229354, CHE-1662030, CHE-1721511, and CHE-1903871 from the National Science Foundation (GCS), the Arnold and Mabel Beckman Foundation Beckman Scholar Award (AGG), and the Barry M. Goldwater Scholarship (AGG). High-performance computing resources of the MERCURY Consortium (http://www.mercuryconsortium.org) were used.
Avogadro | https://avogadro.cc | Open-source molecular visualization program | |
Gaussian [09/16] Software | http://www.gaussian.com/ | Commercial ab initio electronic structure program | |
MOPAC 2016 | http://openmopac.net/MOPAC2016.html | Open-source semi-empirical program | |
OGOLEM Software | https://www.ogolem.org | Genetic algorithm-based global optimization program | |
OpenBabel | http://openbabel.org/wiki/Main_Page | Open-source cheminformatics library | |
calcRotConsts.py | Shields Group, Department of Chemistry, Furman University | Python script to compute rotational constants | |
calcSymmetry.csh | Shields Group, Department of Chemistry, Furman University | Shell script to calculate symmetry number of a molecule given Cartesian coordinates | |
combine-GA.csh | Shields Group, Department of Chemistry, Furman University | Shell script to combine energy and rotational constants from different GA directories | |
combine-QM.csh | Shields Group, Department of Chemistry, Furman University | Shell script to combine energy and rotational constants from different QM directories | |
gaussianE.csh | Shields Group, Department of Chemistry, Furman University | Shell script to extract Gaussian 09 energies | |
gaussianFreqs.csh | Shields Group, Department of Chemistry, Furman University | Shell script to extract Gaussian 09 vibrational frequencies | |
getrotconsts | Shields Group, Department of Chemistry, Furman University | Executable to calculate rotational constants given a molecule's Cartesian coordinates | |
getRotConsts-dft-lb.csh | Shields Group, Department of Chemistry, Furman University | Shell script to compute rotational constants for a batch of large basis DFT optimized structures | |
getRotConsts-dft-lb-ultrafine.csh | Shields Group, Department of Chemistry, Furman University | Shell script to compute rotational constants for a batch of ultrafine DFT optimized structures | |
getRotConsts-dft-sb.csh | Shields Group, Department of Chemistry, Furman University | Shell script to compute rotational constants for a batch of small basis DFT optimized structures | |
getRotConsts-GA.csh | Shields Group, Department of Chemistry, Furman University | Shell script to compute rotational constants for a batch of genetic algorithm optimized structures | |
global-minimum-coords.xyz | Shields Group, Department of Chemistry, Furman University | Cartesian coordinates of global minimum structures of gly-(h2o)n, where n=0-5 | |
make-thermo-gaussian.csh | Shields Group, Department of Chemistry, Furman University | Shell script to extract data from Gaussian output files and make input files for the thermo.pl script | |
ogolem-input-file.ogo | Shields Group, Department of Chemistry, Furman University | Ogolem sample input file | |
ogolem-submit-script.pbs | Shields Group, Department of Chemistry, Furman University | PBS batch submission file for Ogolem calculations | |
README.docx | Shields Group, Department of Chemistry, Furman University | Clarifications to help readers use the scripts effectively | |
runogolem.csh | Shields Group, Department of Chemistry, Furman University | Shell script to run OGOLEM | |
run-pw91-lb.csh | Shields Group, Department of Chemistry, Furman University | Shell script to run a batch of large basis DFT optimization calculations | |
run-pw91-lb-ultrafine.csh | Shields Group, Department of Chemistry, Furman University | Shell script to run a batch of ultrafine DFT optimization calculations | |
run-pw91-sb.csh | Shields Group, Department of Chemistry, Furman University | Shell script to run a batch of small basis DFT optimization calculations | |
run-thermo-pw91.csh | Shields Group, Department of Chemistry, Furman University | Shell script to compute the thermodynamic corrections for a batch of DFT optimized structures | |
similarityAnalysis.py | Shields Group, Department of Chemistry, Furman University | Python script to determine unique structures based on rotational constants and energies | |
symmetry | Shields Group, Department of Chemistry, Furman University | Executable to calculate molecular symmetry given Cartesian coordinates | |
symmetry.c | (C) 1996, 2003 S. Patchkovskii, Serguei.Patchkovskii@sympatico.ca | C code to determine the molecular symmstry of a molecule given Cartesian coordinates | |
template-marcy.pbs | Shields Group, Department of Chemistry, Furman University | Template for a PBS submit script which uses OGOLEM | |
template-pw91.com | Shields Group, Department of Chemistry, Furman University | Template Gaussian 09 input | |
template-pw91-HL.com | Shields Group, Department of Chemistry, Furman University | Template Gaussian 09 input for ultrafine DFT optimization | |
thermo.pl | https://www.nist.gov/mml/csd/chemical-informatics-research-group/products-and-services/program-computing-ideal-gas | Perl open-source script to compute ideal gas thermodynamic corrections | |
gly-h2o-n.xlsx | Shields Group, Department of Chemistry, Furman University | Excel spreadsheet for the complete protocol | |
table-1.xlsx | Shields Group, Department of Chemistry, Furman University | Excel spreadsheet | |
table-2.xlsx | Shields Group, Department of Chemistry, Furman University | Excel spreadsheet | |
table-3.xlsx | Shields Group, Department of Chemistry, Furman University | Excel spreadsheet | |
table-4.xlsx | Shields Group, Department of Chemistry, Furman University | Excel spreadsheet | |
water.xyz | Shields Group, Department of Chemistry, Furman University | Cartesian coordinates of water | |
glycine.xyz | Shields Group, Department of Chemistry, Furman University | Cartesian coordinates of glycine |