Accelerating Pythonic Coupled-Cluster Implementations: A Comparison Between CPUs and GPUs

In this work, we benchmark several Python routines for time and memory requirements to identify the optimal choice of the tensor contraction operations available. We scrutinize how to accelerate the bottleneck tensor operations of Pythonic coupled-cluster implementations in the Cholesky linear algebra domain, utilizing a NVIDIA Tesla V100S PCIe 32GB (rev 1a) graphics processing unit (GPU). The NVIDIA compute unified device architecture API interacts with CuPy, an open-source library for Python, designed as a NumPy drop-in replacement for GPUs. Due to the limitations of video memory, the GPU calculations must be performed batch-wise. Timing results of some contractions containing large tensors are presented. The CuPy implementation leads to a factor of 10–16 speed-up of the bottleneck tensor contractions compared to computations on 36 central processing unit (CPU) cores. Finally, we compare example CCSD and pCCD-LCCSD calculations performed solely on CPUs to their CPU–GPU hybrid implementation, which leads to a speed-up of a factor of 3–4 compared to the CPU-only variant.


Introduction
A critical job of a Graphics card is to compute projections of 3-dimensional objects to a 2D surface using linear algebra.These calculations can be performed in parallel very effectively, meaning that multiple small mathematical operations like multiplication and addition can be performed simultaneously.For this reason, Graphics Processing Unit (GPU) development mainly focuses on massively increasing the number of computing cores.While a Central Processing Unit (CPU) may have up to 128 cores on the high end, GPUs already have up to 16,000 cores (like NVIDIA RTX 4090).
][8][9][10] The first to report independently that the internal hugely parallel structure of GPUs can be misused, not to compute graphics, but to be utilized in quantum chemistry were Yasuda 11 and Ufimtsev and Martinez, 12 respectively.Today, the potential of GPUs for nongraphics-related computations is widely understood and often used to accelerate quantum chemistry methods like density functional approximations, 13 Hartree-Fock theory, 14,15 Møller-Plesset perturbation theory, [16][17][18] coupled cluster theory, 3,8,[19][20][21] and the evaluation of effective core potentials 22 to name a few.The NVIDIA CUDA API offers a C++ interface to utilize GPU computing power. 23,24A relatively quick way to interact with this API is, for example, CuPy, 25 a Python library which internally uses CUDA routines.CuPy brings a lot of attention from Python-based software developers as its interface is highly compatible with NumPy 26 and allows even for a drop-in replacement in some particular cases.Although sacrificing some fine control when exploiting libraries like CuPy, third-party libraries are quickly accessible without the necessity of in-detail back-end control and, therefore, a very convenient and efficient way to probe graphics processor utilization in the first place.
In this work, we will present benchmark results comparing the timings of the bottleneck tensor contractions present in coupled-cluster calculations restricted to at most double excitations, where Cholesky vectors approximate the electron repulsion integrals. 27Specifically, we propose several flavors of performing these bottleneck contractions on a CPU using Pythonic libraries and benchmark their resource requirements.These contractions are calculated with, for instance, NumPy's 26 tensordot and einsum routines or opt_einsum 28 on CPUs.Finally, we elaborate on exporting these operations on a GPU exploiting CuPy's tensordot routine. 25his work is organized as follows.In section 2, we briefly discuss the main bottleneck operations in coupled-cluster calculations.Section 3 scrutinizes several Python-based strategies to compute the coupled-cluster vector function.In section 4, we examine the PyBEST tensor contraction engine.A GPU implementation exploiting the CuPy library is summarized in section 5. Numerical results and assessment of the GPU to CPU performance are presented in section 6.Finally, we conclude in section 7.

The CC ansatz
Our starting point is the coupled cluster (CC) ansatz, [29][30][31][32][33][34] |Ψ⟩ = e T |Φ 0 ⟩ where T is the cluster operator and |Φ 0 ⟩ some reference wave function like the Hartree-Fock determinant.In this work, we will consider, at most, double excitations in the cluster operator, that is, T = T2 .We do not consider single excitations explicitly as the bottleneck operations are due to the T2 excitation operator.Furthermore, we will work in the spin-free representation, with spin-free amplitudes, and the CC equations are spin summed.In this picture, the double excitation operator takes on the form with the CCD amplitude t ab ij and Êai being the singlet excitation operator, where p † (p † ) indicates electron creation operators for α (β) electrons, while p (p) are the correspond-ing annihilation operators.The above sum runs over all occupied (occ) and virtual (virt) orbitals for the chosen reference determinant |Φ 0 ⟩.The scaling-determining step in the CCD amplitude equations is associated with the following term 0 = . . .
Summation over the indices i, j, a, b is implied, which results in the formal scaling of the CCD equations of O(o 2 v 4 ).In the above equation, ⟨ab|cd⟩ are the electron-repulsion integrals in Physicist's notation.To reduce the storage of the full electron-repulsion integrals, they can be approximated by Cholesky decomposition, 27,35 ⟨ab|cd⟩ ≈ x where x indicates the summation over the elements of the Cholesky vectors.Since we work with real orbitals with 8-fold permutational symmetry of the electron-repulsion integrals (ERI), both Cholesky vectors L x ac and L x bd are identical.Then, the CC excitation amplitudes are optimized iteratively by rewriting the rate-determining step of the vector function tab ij of iteration n of the optimization procedure, including a third summation index running over the Cholesky vectors (ignoring the summation over the fixed indices i, j, a, b).Thus, formally, the complexity increases to O(xo 2 v 4 ).Note that the dimension of x depends on the chosen threshold of the Cholesky decomposition.For decent to tight thresholds (around 10 −6 ), we have x ≈ 5(o + v).

Pythonic implementations of the CC vector function
To solve for the CC amplitudes iteratively, we must evaluate the vector function of eq. ( 6) in each iteration, including all its terms.Thus, we will focus on the bottleneck operations of the corresponding CCD vector function.7][38][39][40][41] Based on the chosen routines, these summations (in the following called contractions) can be performed in one shot or sequentially, creating several intermediates to boost efficiency and reduce resource requirements or even enabling out-of-the-box parallelization.In the following, we will scrutinize different variants to evaluate eq. ( 6) as pythonically as possible, focusing solely on Python features and libraries.Specifically, we focus on the NumPy routines einsum and tensordot, the opt_einsum package, and a GPU implementation exploiting CuPy's tensordot feature.

einsum and opt_einsum
The possibly easiest and most straightforward way of avoiding nested for-loop implementations when dealing with tensor contractions is to refer to NumPy's einsum routine, which evaluates the Einstein summation convention on a sequence of operands.Using numpy.einsum,many-albeit not all-linear algebraic operations on multidimensional arrays can be represented in a simple and intuitive language.With increasing version number, additional features and improvements have been incorporated into the numpy.einsumfunction.One crucial parameter is the optimize argument, which allows control over intermediate optimization.If set to "optimal", an optimal path of the contraction in question will be performed.Another possibility is to exploit the numpy.einsum_pathfunction to steer the order of the individual contractions in the most optimal way.An effort to improve the performance of the original numpy.einsumroutine lead to the development of the opt_einsum package. 28It offers several features to optimize numpy.einsumsignificantly, reducing the overall execution time of einsum-like expressions.For instance, it automatically optimizes the order of the underlying contraction and exploits specialized routines or BLAS. 42opt_einsum can also handle various arrays, like NumPy, Dask, PyTorch, Tensorflow, or CuPy, to name a few.Furthermore, the optimization of numpy.einsumhas been passed upstream to the original numpy.einsumproject.
Some of opt_einsum's features can hence be utilized by numpy.einsummodifying the optimization option.Since opt_einsum features more up-to-date algorithms for complex contractions, we will focus on the opt_einsum.contractfunction to evaluate the Einstein summation convention on a sequence of operands, typically con-taining three multi-dimensional input arrays.
opt_einsum.contract represents a replacement for numpy.einsumwhere the order of the contraction is optimized to reduce the overall scaling (and hence increase the computational speed-up) at the cost of several intermediate arrays.To steer the memory limit and prevent the generation of too large intermediates, opt_einsum.contractoffers the memory_limit parameter to provide an upper bound of the largest intermediate array built during the tensor contraction.
Thus, the bottleneck contraction in eq. ( 6) can be straightforwardly evaluated using, for instance, opt_einsum.contract as follows § ¤ In the above code snippet, t_new indicates the vector function of iteration n, t_old the current approximate solution of the CCD amplitudes, L_0 (L_1) is the Cholesky vector of eq. ( 5).Note that tab ij are stored as a 4-dimensional NumPy array t_new[i,a,j,b].

tensordot
An alternative routine to perform a tensor contraction is the tensordot function offered in NumPy 43 and CuPy. 25It efficiently computes the summation of one (or more) given index (indices).It allows for saving memory by de-allocating intermediate arrays and explicitly defining the path of the complete tensor contraction.Furthermore, tensordot makes use of the BLAS 42 API and features a multithreaded implementation when linked against the proper libraries like OpenBLAS, 44 MKL, 45 or AT-LAS. 46A contraction along one axis of two arrays A and B, Similarly, tensordot can contract (that is, sum over) two or more axes in one function call, where Note, however, that tensordot allows for contracting only two operands at a time.Thus, to evaluate the term in eq. ( 6), a sequence of tensordot calls must be performed where suitable intermediates are created.One possibility is to contract the Cholesky vectors to create an intermediate of dimension v 4 , which is then passed to a second tensordot call generating the desired output, § ¤ # f i r s t i n t e r m e d i a t e g e n e r a t e s t h e d e n s e ERI [ a , c , b , d ] Note that tensordot does not reorder the axis.In that case, we need to transpose the intermediate result to match the shape of the output array (the vector function).However, generating a v 4 intermediate of the ERI might be prohibitive regarding memory requirements for larger systems.Other possibilities lead to even larger intermediates.For instance, contracting L_0 with t_old yields a multidimensional array of size x is a pre-factor depending on the threshold of the Cholesky-decomposed ERI.This pre-factor is typically challenging to determine a priori.Nonetheless, for computationally feasible problems, the condition ao 2 ≪ v is rarely satisfied, making the first contraction path computationally more efficient in terms of memory.A less elegant, albeit computationally cheaper way to use all the benefits of the tensordot function is to introduce one for-loop to iterate over one axis.If we choose to loop over the second axis of L_0, we generate intermediates of at most v 3 , § ¤ ¦ ¥ The following will refer to the contraction path above as our numpy.tensordotroutine.Note, however, that this is not purely a numpy.tensordotcomputation, but an iterative call of the numpy.tensordotmethod to calculate eq. ( 6) to prevent the creation of v 4 intermediate tensors.

A modular implementation of tensor contractions
We have implemented and benchmarked the performance of the above-mentioned tensor contraction routines in PyBEST. 37Specifically, PyBEST is designed as a modular toolbox where the wavefunction-specific implementations are decoupled from the linear algebra operations.In an actual calculation, the logic in choosing the optimal tensorcontraction scheme is as follows § ¤ t r y : cupy−accelerated contraction e x c e p t NotImplementedError : t r y : numpy .tensordot contraction e x c e p t ArgumentError : opt_einsum i f not available then numpy .einsum ( . . ., optimize=" o p t i m a l " )

¦ ¥
The first try statement enforces that selected tensor contractions are performed on the GPU if available.If bottleneck-specific contractions are not supported, or a CUDA-ready GPU is unavailable, a numpy.tensordotcall is performed.Since numpy.tensordotsupports only specific contractions featuring non-repeating indices (that is, repeated indices have to be summed over in numpy.tensordot),an opt_einsum call is performed or, if opt_einsum is not available, an optimized numpy.einsumfunction call is made instead.The corresponding tensor contraction operation is written using the Einstein-summation convention of the numpy.einsummodule, that is, all mathematical operations use an input and output string, where repeated indices are summed over.That allows for one unique notation of mathematical operations independent of the underlying representation of the tensors, especially the ERI.As an example, the bottleneck contraction in eq. ( 6) translates to the string "abcd,ecfd->eafb", where the first part (abcd) corresponds to the ERI, the second part (ecfd) to the doubles amplitudes, while the output string (eafb) indicates the order of the output indices of the vector function.The subscripts argument is a string specifying the contraction using the numpy.einsumnotation, while all multi-dimensional input arrays are stored in the operands argument.We assume that the ERI corresponds to the leading subscripts.If a tensor contraction cannot be performed, we transition to opt_einsum or numpy.einsum(..., optimize="optimal") and an ArgumentError is raised.In case of too large intermediates, the tensordot_helper function is replaced by a function call containing selected hand-optimized tensor contraction operations.The implemented flow of contraction operations (CuPy-numpy.tensordot-opt_einsum/numpy.einsum)allows for an optimal usage of computational resources, hardware, and multithreaded implementations.The reasons for the proposed operational flow are scrutinized in section 6.When the performance of the used libraries is improved in future releases, the order of the contraction flavors can be adjusted to maximize efficiency without significantly interfering with the underlying source code in each wave function module.
In the following, all benchmark calculations adhere to the contraction flow mentioned above if not stated otherwise.That is, the majority of tensor contractions are performed using the numpy.tensordotroutine, while the bottleneck contractions are outsourced to the GPU.

CuPy and batch-wise computations
While, of course, the most performance one could achieve by designing a method specifically for the hardware to be calculated, we focus on accelerating the mathematical bottleneck operations by exporting the contractions in question to the GPU.
As proof of the concept, we exploit CuPy 25 for GPUaccelerated computing.Specifically, it is written as a drop-in replacement for NumPy. 43For medium-or large-sized molecules, we have to consider the size of the multi-dimensional arrays present in eq. ( 6) and, therefore, the amount of data that needs to be processed and transferred to the Video RAM (VRAM).The principle of memory and processor communication is shown in Fig. 1.Due to their size, the underlying multi-dimensional arrays might be too big to be transferred, processed on the GPU, and transferred back.Thus, a generic implementation, where a NumPy implementation is recycled as a CuPy implementation by replacing, for instance, numpy.einsumwith cupy.einsum and taking into account host to device and device to host operations is impracticable or even impossible.
To calculate contractions on the GPU for realistic molecules and large basis sets, it is necessary to do the computations in batches.In our batch-wise computing approach, the arrays on which the operations will be performed must be copied to the VRAM in chunks.To maximize performance and utilize the massive number of CUDA cores, the size of the chunks has to be chosen as big as possible so that as few as possible cycles are needed.To achieve reasonably-sized chunks of data to be processed in batches, we multiply the number of elements with their corresponding element size (in bytes) and sum up the required storage space for each multi-dimensional array involved.If too large, the arrays are split, and the necessary memory is checked again.This process is repeated till the split arrays fit into the video memory.An example that shows the splitting, allocations, and de-allocations is shown in the following code example for the contraction xac,xbd,ecfd->eafb: § ¤ t_new = numpy .zeros ( ( t_old .shape [ 0 ] , L_0 .shape [ 1 ] , t_old .shape [ 2 ] , L_1 .shape [ 1 ] ) ) start_e = 0 end_e = 0 # parts_d i s t h e number o f p a r t s i n which t h e d e n s e a r r a y w i l l be s p l i t f o r e i n r a n g e ( 0 , parts_d ) : end_e += dense_e_chunk_lengths [ e ] start_a = 0 end_a = 0 # parts_c i s t h e number o f p a r t s i n which t h e Cholesky a r r a y w i l l be s p l i t f o r a i n r a n g e ( 0 , parts_c ) : 6 Numerical results

Software specifications and computational details
All calculations were performed on the CentOS 7 operating system using, if not mentioned otherwise, the CUDA compilation tools v12.The molecular structure of the L0 dye was optimized in the ORCA 5.0.3 software package [47][48][49][50] using the B3LYP 51,52 exchange-correlation functional and the cc-pVTZ basis set. 53The resulting structural parameters are provided in section S2 of the Supplementary Information.That molecular structure is subsequently used in the orbital-optimized pair coupled cluster doubles (pCCD [54][55][56][57][58][59] ) augmented with the linearized coupled cluster singles and doubles (pCCD-LCCSD) correction, 60 and the conventional coupled-cluster singles and doubles (CCSD) approach, as implemented in the PyBEST software package. 37In all PyBEST calculations for the L0 dye, we utilized the Cholesky linear algebra factory with a threshold of 10 −5 for the ERI.In all benchmark calculations concerning timings and memory requirements, the Cholesky vectors are random arrays, where we assume a size of 5K 3 with K corresponding to the number of basis functions.This roughly corresponds to a Cholesky cutoff threshold of 10 −5 in actual molecular calculations.

Comparison between CPU and GPU-accelerated implementations
To be able to make assumptions about the benefit of offloading computations to the GPU, it is reasonable to study how effectively different functions described in section 3 perform compared to each other.In the following, the computation times of numpy.tensordot,opt_einsum, and their CPU multicore processing behavior are investigated and compared with CuPy's tensordot.As mentioned above, a very handy way for implementing tensor contractions is numpy.einsumor opt_einsum, which feature a similar syntax.We should note that although numpy.einsumand opt_einsum have similar performance if two operands are contracted with each other, this is not the case anymore if the list of operands contains several multi-dimensional arrays.In the latter case, opt_einsum may be superior to numpy in terms of computing time by several orders of magnitude, primarily due to the parallelization of the underlying lower-level opt_einsum routines.Thus, we only show benchmark results for opt_einsum in this work.Furthermore, we investigate three tensor contractions, namely "abcd,ecfd->eafb", "abcd,edfc->eafb", and "abcd,ecfd->efab".The first contraction ("abcd,ecfd->eafb") corresponds to the bottleneck term of eq. ( 6), while the second one is the associated exchange term, which reads Although the exchange part of the formal bottleneck contraction is not present when working in a spin-free representation, we benchmark this contraction for reasons of completeness, in case a spindependent implementation is sought.The third contraction "abcd,ecfd->efab" recipe is used in two additional terms of the CCSD vector function.These are one term involving an intermediate tijkl , which is an operation of O(xo 4 v 2 ) complexity, and another one comprising the ERI of O(ov 3 ), with computational complexity of O(xo 3 v 3 ).Note that exporting eq. ( 8) to the GPU is merely a byproduct of eq. ( 9) as both contraction can be written using the same subscript.In both plots, the greenish colors show tensordot and blueish colors show opt_einsum computations using 1 CPU core.
Compared to (our version of) numpy.tensordot, the internal optimization leads to a speed-up of factor 2 as indicated by Fig. 2(a).The reason we chose numpy.tensordotas the workhorse of all tensor contractions instead of opt_einsum (or numpy.einsum), is the severe disadvantage of opt_einsum regarding memory efficiency.This drawback becomes evident in the peak memory usage displayed in Fig. 2(b).The internal optimization and the creation of intermediate arrays (for speed-up) lead to a factor 3 fastergrowing memory consumption.That disqualifies opt_einsum as a generic option primarily because we do not want to and often cannot constrain the code to smaller problem sizes.Therefore, we employ numpy.tensordotas the default contraction flavor if Nvidia CUDA is not available.We should note that opt_einsum features other arguments that limit the memory peak to a specific size.However, this feature comes at the cost of computing time.Specifically, a user-defined limit of the memory peak significantly deteriorates the speed of the numerical operations, which renders opt_einsum impractical for large-scale tensor contractions.
Fig. 3(a) summarizes timings for computations of the contraction "abcd,ecfd->eafb" (or "xac,xbd,ecfd->eafb" if the Cholesky vectors are explicitly mentioned) plotted over the number of CPU cores for various tensor contraction flavors and problem sizes N .The greenish colors show numpy.tensordot, the blueish colors opt_einsum, while reddish colors indicate cupy.tensordottiming results for different numbers of CPU cores.Specifically for Fig. 3(a), the  problem size is given by N = o 2 v 2 because we investigate the following problem setup with dimensions , where v and o are the number of virtual and occupied orbitals respectively.Their sizes have been set according to the relation v = int( 3 4 K) and o = K − v, where K is the total number of basis functions.We should mention that the data for K = 500 using opt_einsum could not be obtained due to technical problems as the memory consumption/memory peak exceeds the available physical memory of the computing node.Overall, the opt_einsum calculations are a factor 3 faster than the corresponding numpy.tensordotvariants but limited by their memory peak.Note, however, that the for-loop-based numpy.tensordotvariant is roughly a factor of 2 slower compared to the contraction scheme where the v 4 intermediate is formed.Thus, opt_einsum and numpy.tensordotare comparable in performance, with the former being modestly faster.
Fig. 3(a) further highlights that the benefit of performing a computation on a larger number of cores shrinks very rapidly.The timings reach a plateau at about 8 to 10 cores.Judging by the numbers, the contraction functions do not really benefit from the usage of more than 10 cores, with 4 cores being the most reasonable amount, assuming the computing time is limited.The bigger the basis set or problem size the more benefit one gets from utilizing a higher number of cores.The cupy.tensordot timings are also shown in Fig. 3(a) for a direct comparison.We should note that they are independent of the number of CPU cores (as the mathematical operations are performed on the GPU) and faster than both the NumPy and opt_einsum alternatives.
Fig. 3(b) compares the required computing time of CPU and GPU-accelerated contractions.The results corresponding to GPU and CPU timings are taken as an average of over 10 runs.Furthermore, the CPU data was obtained from computations exploiting 1 and 36 CPU cores, respectively.Fig. 3(b) contains 3 different datasets, namely, one for each investigated contraction "abcd,ecfd->eafb", "abcd,edfc->eafb", and "abcd,ecfd->efab".Note that the notation is internally translated to the Cholesky vectors by replacing the "abcd" part of the string by "xac,xbd".Thus, we always have three input operands in each tensor contraction operation.We should stress that the timings corresponding to all three contraction subscripts are very close to each other making them almost indistinguishable from each other on the plot.For reasons of comparability, the x axis shows the problem size, which is given by the size of the resulting (output) tensor.Table 1 summarizes the different problem sizes with respect to the contraction subscript, namely, N = o 2 v 2 for "abcd,ecfd->eafb" and "abcd,edfc->eafb" and N = o 3 v for the contraction "abcd,ecfd->efab", respectively.The standard deviation for the timings of larger problem sizes is about 1% of the average value, while it increases to approximately 10% for smaller problem sizes.The elements of the arrays used to perform the benchmark calculations were randomly generated.We attribute the larger standard deviation for smaller problem sizes to the higher impact a randomly generated array (e.g., very small numbers) can have, when among fewer elements in an operation.
All in all, we observe an order of magnitude reduction of computing time for the CuPy implementation computed on the NVIDIA Tesla V100S PCIe 32GB (rev 1a) compared to our NumPy implementation computed on 36 CPU cores for 3 different tensor contractions encountered in the CCSD working equations.

A case study -a sensitizer molecule
To check the overall speed-up in actual chemical problems, we will perform example calculations with and without GPU acceleration.Specifically, we will scrutinize the impact on the overall computing time when offloading just a given set of contractions to the GPU.2-Cyano-3-(4-N,Ndiphenylaminophenyl)-trans-acrylic acid, commonly referred to as L0 dye, 61 has been chosen as the candidate for this performance test.This organic dye was designed to be a potential sensitizer in dye-sensitized solar cells, making it a viable alternative to ruthenium-type dyes.The size of the L0 dye allows us to benchmark the Python-based hybrid CPU-GPU implementation for a chosen parameter set.We tested our CCSD and pCCD-LCCSD implementations by exploiting a cc-pVDZ basis set in our benchmark calculations.All calculations were performed with 36 parallel threads.Table 2 compares different CPU timings in seconds for cupy.tensordot to the timings of our numpy.tensordotvariant.As highlighted in Table 2, one iteration step of the vector function takes around 2600 seconds on the CPU, while the corresponding function requires only 770 seconds to be evaluated in the case of the hybrid CPU-GPU variant.Note that the average iteration step time shown in Table 2 includes the evaluation of the vector function, the update of the CC amplitudes, and the evaluation of the CC energy expression, which is similar for the CPU and hybrid CPU-GPU implementation as those operations are not offloaded to the GPU.The difference in the computing times between the CPU and CPU-GPU implementation We observe similar speed-ups for the pCCD-LCCSD method.We should note that our generic GPU implementation also offloads the contraction "abcd,cedf->aebf" to the GPU in addition to the bottleneck operation "abcd,ecfd->eafb".The third bottleneck operation "abcd,ecfd->efab" corresponds to disconnected T2 terms and does not show up in the pCCD-LCCSD vector function.This additional tensor contraction corresponds to a term of O(o 4 v 2 ).As expected, the evaluation time of the pCCD-LCCSD vector function takes less time compared to the CCSD implementation as we exclude the majority of the disconnected terms.Offloading to the GPU reduces the computing time for the selected bottleneck operations from about 2000 s to 133 s, which corresponds to a speed-up factor of approximately 15.The overall speed-up for one evaluation of the pCCD-LCCSD vector function drops to a factor of 2541.80/662.38 ≈ 3.8.On average, the bottleneck contractions take about 20% of the computing time of the vector function per iteration step for the CPU-GPU hybrid implementation, while the corresponding time increases to 80% for the CPU variant, which is similar to our CCSD example.
We conclude that offloading the slowest contraction of the form of eq. ( 6) to the GPU already leads to a significant acceleration, shifting the bot-tleneck to another set of terms of the CC working equations.Additional speed-up can be obtained by offloading other speed-determining tensor contractions to the GPU.

Conclusions and outlook
Current trends in scientific programming heavily rely on Python-based implementations, 62 which are easy to code but need to be more easily scalable to high-performance computing architectures.At the same time, the potential end-users of these codes would like to work on supercomputers to solve problems as large as possible without a profound knowledge of high-performance optimizations.To meet those needs for quantum chemistry problems, we analyzed the limitations of quantum chemistry methods written in Python, where some trade-off between the memory and CPU time has to be made.We showed how to use the existing Python libraries to speed up quantum chemical calculations and provided numerical evidence, including comparisons between CPU and GPU.
A common and reasonable practice to speed up an implementation is to identify the so-called bottleneck operations and focus on optimizing these routines.
In this work, we focused on the bottlenecks of CCSD-type methods, which are a given set of tensor contractions, and their translation into Python code using various thirdparty libraries.Specifically, we scrutinized computing timings and memory consumption, comparing numpy.tensordotand opt_einsum calculations.We found that opt_einsum computes about a factor 2.5 faster than a modified version of numpy.tensordot on a CPU but is limited due to tremendous memory peaks and, therefore, not a universal candidate when large system sizes are considered.Furthermore, we have rewritten selected numpy.tensordotroutines imposing one for loop to prevent the construction of large intermediate arrays.In general, this for-loop implementation is responsible for the drop in efficiency (speed) of numpy.tensordotcompared to opt_einsum by approximately a factor of 2. Nonetheless, the gain in memory reduction outweighs the decrease in speed.Although third-party libraries are convenient to interface and easy to use, special attention has to be paid to possible intermediates created during the evaluation of, for instance, tensor contractions.Both opt_einsum and various NumPy linear algebra routines may create several inter-mediate arrays to increase speed-up.These memory outbursts may impede a black-box implementation that is computationally feasible for large-scale problems.
A promising alternative for Pythonic large-scale computing is GPUs.We implemented this set of bottleneck tensor contractions to be calculated on the GPU using CuPy, an application programming interface (API) to Nvidia CUDA for Python.Due to the size of the arrays, it is necessary to perform the computations on the GPU batch-wise by splitting the overall tensor contractions into smaller-sized problems that fit into the video memory (VRAM).Utilizing this implementation and performing benchmark calculations of the bottleneck contractions showed that a single GPU already leads to a factor of 10 speed-up compared to our NumPy-based methods using 36 CPU cores.This factor 10 speed-up of the contraction routines only translates to an overall speed-up of factor 3 as the bottleneck of the CC vector function evaluation has shifted to a collection of terms of similar scaling, which are still evaluated on the CPU.Most importantly, our timing benchmark results are encouraging and again prove the potential of GPU utilization in Python-based computational chemistry.
Subsequently, we will identify the "new" bottleneck operations in tensor contractions and export them to the GPU for additional potential speed-up.Better utilization of the tensor cores could also lead to improvement, as good speed-ups are reported in developer forums, where FP32 input/output data in DL frameworks and HPC can be accelerated, running ten times faster than V100 FP32 FMA operations. 63Yet, additional speed-up for quantum chemistry problems is expected on the NVIDIA A100 hardware, 64 which has more GPU cores and up to 80 GB of memory.Generally, direct back-end CUDA implementation in C++ also offers plenty of room for improvements like optimization of the array structure or concurrent computing.Alternative routes to export Python code to the GPU are, for instance, offered by the Intel oneAPI toolkits.Finally, multi-GPU utilization 65 could further improve the already tremendous speed-up of factor 10.

Figure 1 :
Figure 1: Schematic picture of RAM VRAM GPU and CPU and the data transfer rate between the components.
e number o f p a r t s i n which t h e Cholesky a r r a y w i l l be s p l i t f o r b i n r a n g e ( 0 , parts_c ) :end_b += chol_chunk_lengths_1 [ b ] # Batches o fCholesky a r r a y s a r e c o p i e d t o GPU Memory (VRAM) chol_0 = cupy .array ( numpy .array_split ( L_0 , parts_c , axis =1) [ a ] ) chol_1 = cupy .array ( numpy .array_split ( L_1 , parts_c , axis =1) [ b ] ) # c a l c u l a t i n g batch o f xac , xbd −>acbd on GPU result_temp = cupy .tensordot ( chol_0 , chol_1 , axes =(0 , 0 ) ) # d e a l l o c a t i o n d e l chol_0 , chol_1 cupy .get_default_memory_pool ( ) .free_all_blocks ( ) # Batches o f t h e v e c t o r f u n c t i o n a r r a y i s c o p i e d t o VRAM operand = cupy .array ( numpy .array_split ( t_old , parts_d , axis =0) [ e ] ) # c a l c u l a t i n g batch o f acbd , e c f d −>a b e f on GPU result_temp_2 = cupy .tensordot ( result_temp , operand , axes = ( [ 1 , 3 ] , [ 1 , 3 ] ) ) # d e a l l o c a t i o n d e l operand , result_temp cupy .get_default_memory_pool ( ) .free_all_blocks ( ) # c a l c u l a t i n g batch o f t h e t r a n s p o s i t i o n a b e f −>e a f b on GPU result_part = cupy .transpose ( result_temp_2 , axes =(2 , 0 , 3 , 1 ) ) # d e a l l o c a t i o n d e l result_temp_2 cupy .get_default_memory_pool ( ) .free_all_blocks ( ) # r e s u l t a r r a y s a r e c o p i e d back t o t h e memory (RAM) t_new [ start_e : end_e , start_a : end_a , : , start_b : end_b ] = result_part .get ( ) # d e a l l o c a t i o n d e l result_part cupy .get_default_memory_pool ( ) .free_all_blocks ( ) start_b = end_b start_a = end_a start_e = end_e ¦ ¥

Figure 2 :
Figure 2: Timings (a) and memory usage (b) of numpy.tensordotand opt_einsum with respect to the size (number of elements) of the largest input array.In both plots, the greenish colors show tensordot and blueish colors show opt_einsum computations using 1 CPU core.

Figure 3 :
Figure 3: Computing times for CPU and GPUaccelerated implementations.(a) Timings for the bottleneck contraction abcd,ecfd->eafb.The greenish colors show numpy.tensordot, the blueish colors correspond to opt_einsum, while reddish colors indicate cupy.tensordotresults.K is the total number of basis functions.(b) Comparison of GPU and CPU computing times for different basis set sizes K.The numpy.tensordot results for calculations with 1 CPU thread are shown in blueish colors and for 36 CPU threads in greenish colors.The reddish colors show the cupy.tensordotresults.The error bars in orange are determined for an average of 10 runs and show the standard deviation.For the number of elements of the final array, see the description in the main text.

Figure 4 :
Figure 4: Molecular structure of the L0 dye relaxed at the B3LYP/cc-pVTZ level of theory.
Since numpy.tensordotonly supports a summation over two multi-dimensional arrays at a time, we need to divide a tensor contraction containing more than two operands into appropriate intermediates.Such a partitioning can be fully automatized, exploiting the numpy.einsum_pathfunction.It proposes a contraction order of lowest possible cost for an einsum expression, taking into account the creation of intermediate arrays.The resulting tensordot_helper function has the following logic § ¤ Internally, this string is further decoded according to the notation used in the employed LinalgFactory instance, a dense or Cholesky-decomposed representation.If a dense representation of tensors is chosen, the string remains as is.For Cholesky-decomposed ERI, the input argument associated with the Cholesky instance (here "abcd") is translated to the internal Cholesky representation, that is, "xac,xbd".r a i s e ArgumentError ( error message ) # g e t t h e o p t i m i a l path path , _ = numpy .einsum_path ( subscripts , * operands ) # sometimes einsum_path d o e s not r e t u r n a l i s t o f two−e l e m e n t t u p l e s i f no optimized path has been found : path = provide some path # l o o p o v e r a l l s t e p s i n t h e path f o r step i n path [ 1 : ] : # t a k e 1 s t and 2nd o p e r a n d i f outscript does not have the proper shape : trans = t u p l e how to transpose the output array outmat = outmat .transpose ( trans ) # r e t u r n r e s u l t o f t h e t e n s o r c o n t r a c t i o n r e t u r n outmat ¦ ¥

Table 1 :
Problem sizes N determined for different basis set sizes K for the selected contractions benchmarked in this work.The last column indicates the size of the output array.

Table 2 :
Timings [s]and speed-up factors for the CPU and hybrid CPU-GPU implementation for selected CC calculations using a cc-pVDZ basis set (444 basis functions).Iteration step time indicates the time [s] required for one CC iteration step.It contains the evaluation of the vector function, the update of the CC amplitudes, and the evaluation of the CC energy expression.All timings correspond to differences in epoch times.Average iteration step: mean value for the time of one CC iteration averaged over 4 steps.Vector function step: time of the CPU/GPU part of the vector function averaged over 4 steps.Bottleneck contractions: time for all bottleneck contractions investigated in this work, that is, those exported to the GPU, averaged over 4 steps.