Low-Scaling GW Algorithm Applied to Twisted Transition-Metal Dichalcogenide Heterobilayers

The GW method is widely used for calculating the electronic band structure of materials. The high computational cost of GW algorithms prohibits their application to many systems of interest. We present a periodic, low-scaling, and highly efficient GW algorithm that benefits from the locality of the Gaussian basis and the polarizability. The algorithm enables G0W0 calculations on a MoSe2/WS2 bilayer with 984 atoms per unit cell, in 42 h using 1536 cores. This is 4 orders of magnitude faster than a plane-wave G0W0 algorithm, allowing for unprecedented computational studies of electronic excitations at the nanoscale.

Electronic excitations in matter play a pivotal role in various physical phenomena, including light absorption and transport.The characteristics of these excitations are strongly influenced by the host material.Excitons, which are bound electron-hole pairs, exhibit a remarkable and unusually strong electron-hole binding in low-dimensional semiconductors that have emerged in the last decade [1].When stacking two atomically thin semiconductors on top of each other, the atomic alignment between the layers can exhibit periodic variations, leading to a new type of in-plane superlattice known as the moiré superlattice.Excitons in moiré structures have gained enormous attention recently [2][3][4][5][6][7][8][9][10][11][12] thanks to their highly unusual exciton properties which include spatial confinement due to the moiré potential [2], interlayer [5,6], and intralayer charge transfer [9].Furthermore, electronic properties of moiré lattices can be tuned by the band alignment and the twist angle between the layers such that moiré structures hold great promise as an exciting platform for probing electronic and photonic quantum phenomena over the next decade [12].
Gaining insights into excitons in moiré structures can be achieved through a combination of experiments, theoretical models, and computations.As an example, low-angle MoSe 2 /WS 2 moiré structures have shown an interesting interplay of intra-and interlayer exciton hybridization because of the nearly degenerate conduction bands of the MoSe 2 and WS 2 layers.The conduction band offset and the wavefunction hybridization between layers, however, is still under debate [3,7,[13][14][15].Detailed knowledge about the electronic band structure of the MoSe 2 /WS 2 moiré bilayer and the implication on exciton formation and binding is thus crucial to resolve this controversy.
In this work, we focus on the GW method from many-bodyperturbation theory [16][17][18] which is an approximation for the electronic self-energy that allows for computing the electronic band structure of a given material.Importantly, GW accounts for the nonlocal, frequency-dependent screening of the interaction between electrons which is crucial in moiré bilayers.
The GW band structure is then the basis for the description of excitons via the Bethe-Salpeter equation [17,19].Currently available plane-wave-based GW algorithms are however incapable of treating low-angle moiré cells that contain thousands of atoms [20], despite their computational scalability to the largest supercomputers [21][22][23][24][25]. Stochastic GW methods may enable large-scale GW calculations [26,27], but it is not clear whether the numerical precision of this approach is sufficient for its application across the whole chemical space [28].For computing the GW band structure in large moiré cells, pristine unit-cell matrix projection (PUMP) has been suggested [9,20].PUMP is based on expanding the moiré cell wavefunctions in terms of the pristine unit-cell wavefunctions.By construction, PUMP cannot capture nanometer-scale atomic reconstruction of moiré structures which can dramatically influence their electronic band structure [4].
The GW space-time method [29] offers a promising route towards large-scale GW calculations.This is because the computational scaling is reduced from O(N 4 at N 2 k ) for standard GW algorithms to O(N 3 at N k ) in the GW space-time method, where N at is the number of atoms in the unit cell and N k the number k-points used to discretize the Brillouin zone.For achieving the scaling reduction, it is required to use a spatially local basis instead of plane waves.The local basis can be chosen as realspace grid where studies of unit cells up to one hundred atoms have been reported [29,30].Another choice of the spatially local basis is an atomic-orbital-like basis [31].This choice is highly efficient in the GW space-time method enabling GW calculations on molecules with more than 1000 atoms [32][33][34][35][36].
Periodic boundary conditions in the GW space-time method with atomic-orbital-like basis functions have not been reported yet.The main inhibiting factor has been the inclusion of k-dependent Coulomb interactions which represent a major challenge regarding computational efficiency and numerical precision [37][38][39].In this work, we overcome this challenge by employing real space representations of the polarizability, the screened Coulomb interaction and the self-energy.The real-space representation allows us to use the minimum image convention (MIC) [40,41], i.e., each atomic orbital in the simulation interacts only with the closest image of another atomic orbital.We benchmark the algorithm on G 0 W 0 bandgaps of monolayer MoS 2 , MoSe 2 , WS 2 , and WSe 2 finding an average deviation of only 0.06 eV from reference calculations [42,43].We also apply the GW algorithm to a MoSe 2 /WS 2 bilayer with an unprecedented cell size of 984 atoms which has an order of magnitude more atoms than previous state-of-the-art large-scale GW calculations [25].
We start with details on the main algorithmic advances for achieving large-scale GW calculations on two-dimensional semiconductors.The full GW algorithm is given in the Supporting Information.
Following our previous work [35,44], we compute the irreducible polarizability χ PQ (k=0, iτ) in imaginary time iτ at the Γ-point in an auxiliary atomic-orbital-like Gaussian basis set with indices P, Q.The polarizability χ PQ (k, iτ) is however needed on a dense k-point mesh because it is later multiplied with the bare Coulomb interaction that diverges at the Γ-point and thus requires a fine k-point sampling.The atom-centered basis allows us to decompose the Γ-point result, χ PQ (k=0, iτ), using the identity where χ R PQ is the real-space representation of the polarizability and ϕ R P denotes a Gaussian which is localized in cell R.For non-metallic systems, the polarizability χ(r, r , iτ) is space-local, i.e. χ(r, r , iτ) exponentially decays with increasing |r − r |. [45,46] The matrix element χ R PQ thus vanishes in case of a large distance between the center of ϕ 0 P and the center of ϕ R Q .We employ MIC, i.e., we assume that χ R PQ (iτ) in Eq. ( 1) is non-zero only if the atomic center of ϕ 0 P and the atomic center of ϕ R Q are closest together among all cells R. In this way, we extract χ R PQ (iτ) from Eq. ( 1), which is exact in the limit of a large, non-metallic unit cell.Using Eq. ( 2), we obtain the polarizability at any k-point at negligible computational cost, Following the GW space-time method [29], we compute the screened interaction in real space (full algorithm in SI), leading to the self-energy Σ(r, r , iτ) = iG(r, r , iτ)W(r, r , iτ) [29].Σ(r, r , iτ) is space-local as G(r, r , iτ) is space-local [45] and only elements of W(r, r , iτ) with small |r − r | contribute to Σ.We thus continue with the minimum image of Eq. ( 4) where the cell vector gives the smallest distance between the atomic centers R P of ϕ 0 P and the atomic center We use W MIC (iτ) to calculate the self-energy Σ µν (k=0, iτ) in the atomic-orbital basis {φ µ (r)} at the Γ-point.We thus avoid k-point sampling in this computationally expensive step.k-points in Σ follow from MIC at negligible computational cost, cf.Eqs. ( 2), (3), We transform the self-energy to real energy [18] and the Bloch basis which allows us to compute quasiparticle energies where v xc nk is the diagonal of the exchange-correlation matrix.The numerical trick in the presented GW algorithm is the MIC used in Eqs.(2), (5), and (7).MIC is exact in the limit of a large unit cell.We determine the critical cell size for the validity of MIC by computing the G 0 W 0 bandgap of monolayer MoS 2 , MoSe 2 , WS 2 , and WSe 2 , presented in Fig. 1.For the four materials, the bandgap changes on average by only 11 meV between the 10 ×10 supercell (300 atoms in the unit cell) and  8) as function of the supercell size (TZVP-MOLOPT basis set [47], without spin-orbit coupling (SOC)).
Table I.G 0 W 0 @PBE bandgap (in eV, without SOC) of monolayer WS 2 , MoS 2 , WSe 2 and MoSe 2 computed from Eq. ( 8) (TZV2P-MOLOPT basis [47], 10 × 10 supercell, detailed convergence test in the SI) and computed from plane-wave codes [39,42,43].We compare the G 0 W 0 bandgap of monolayer MoS 2 , MoSe 2 , WS 2 , and WSe 2 to the G 0 W 0 bandgap computed from three different plane-wave codes [39,42,43], see Table I.We find that our G 0 W 0 bandgaps deviate on average by only 0.06 eV to the bandgaps from plane-wave based codes.This small discrepancy might be due to the use of different pseudopotentials and the difficulty to reach the complete-basis-set limit.

Software package
The presented algorithm has several computational advantages over plane-wave-based GW algorithms.The computational bottleneck in plane-wave-based GW algorithms is the calculation of the irreducible polarizability [21,22,50], where G, G are reciprocal lattice vectors characterizing the plane wave e iG•r , q is a vector in the first Brillouin zone, n, n refer to occupied and empty bands, respectively, and the brackets in the second line denote integrals of a plane wave and Bloch states.The matrix in Eq. ( 9) is evaluated up to |G 2 | < |E cut | for both G and G where E cut is the dielectric energy cutoff.We calculate the number of floating point operations necessary to perform the multiplications in Eq. ( 9), see gray traces in Fig. 2. The estimate corresponds to the computational effort of a plane-wave based G 0 W 0 algorithm for 2D semiconductors and is based on realistic numerical parameters used in large-scale G 0 W 0 calculations [25,39].We also report the required number of operations of our presented G 0 W 0 algorithm in Fig. 2 (black traces).Our G 0 W 0 algorithm requires a similar number of floating point operations for a 9 × 9 supercell as a plane-wave G 0 W 0 algorithm for a 2 × 2 supercell.For a 1×1 2×2 4×4 6×6 8×8 10×10 14×14 needed for executing G 0 W 0 algorithms.Black: low-scaling G 0 W 0 algorithm from this work using a TZVP-MOLOPT basis set [47], gray: plane-wave-based G 0 W 0 algorithm, Eq. ( 9).Underlying computational parameters are typical for monolayer MoS 2 , MoSe 2 , WS 2 and WSe 2 , see detailed raw data available in the SI.
14 × 14 supercell, our G 0 W 0 algorithm requires 10 5 times less operations compared to a plane-wave based algorithm.This large factor has several origins, most important are the following: The plane-wave basis {e iG•r } resolves large vacuum regions [25,39] for two-dimensional materials and is thus a factor 10 larger than the Gaussian auxiliary basis {ϕ P }.We thus need to calculate 100 times less matrix elements of χ in a Gaussian basis compared to Eq. ( 9).Integrals over Gaussians, similar to the second line of Eq. ( 9), are sparse due to the spatial locality of Gaussians [35].Only 3 % of the integrals need to be considered for a 14 × 14 supercell reducing the number of operations by another factor 30.In the present algorithm, χ is evaluated at the Γ-point using real-valued matrix algebra [35] which makes another factor 4 compared to the complex matrix algebra in Eq. ( 9).In Eq. ( 9), at least a 3 × 3 mesh for q is necessary [25] which is responsible for another factor of 5 [51].These numerical parameters thus explain a factor of 60.000 between the required operations of a plane-wave G 0 W 0 algorithm and the G 0 W 0 algorithm from this work.Further advantages compared to plane-wave based algorithms include the cheap diagonalization of the Kohn-Sham matrix to obtain Bloch states thanks to the small Gaussian basis.Also, non-periodic directions are easily dealt with in our GW algorithm by restricting the sum over cells R to periodic directions.It is not necessary to truncate the Coulomb operator [39,52] in non-periodic directions as in plane-wave algorithms.Moreover, the self-energy (7) is available in the Gaussian basis set which allows to compute the G 0 W 0 correction for all Bloch states at low computational cost.
We measure the computation time of the algorithm, shown in Fig. 3.The computation time is moderate; as an example, a G 0 W 0 calculation on the 10 × 10 MoSe 2 supercell (300 atoms) takes only 7 hours on 576 cores.Assuming ideal scalability starting from the 9 × 9 cell, we estimate that a G 0 W 0 calculation on 4500 atoms is in reach [53].Scalability improvements are subject of ongoing work to achieve this system size in practice.
We now focus on an application of the G 0 W 0 algorithm to transition-metal dichalcogenide heterobilayers which recently gained increased attention due to twist-angle dependent moiré potentials and interlayer excitons [3,4,[7][8][9][12][13][14][15].Recent large-scale plane-wave-based GW calculations on twisted heterostructures were limited to 75 atoms in the unit cell [25].This GW computation [25] has been described to be highly cumbersome and it was only achieved owing to an advanced accelerated large-scale version of the BerkeleyGW code which scales to entire leadership high-performance computers with more than half a million cores [22,23].Small unit cells with 75 atoms only allow for the study of heterobilayers with selected, large twist angles and absent atomic reconstruction.
In order to illustrate the large-scale capabilities of our G 0 W 0 algorithm beyond monolayers, we focus on the prototypical MoSe 2 /WS 2 twisted heterostructures.On one hand, the different lattice parameters of MoSe 2 and WS 2 gives rise to a considerably large moiré periodicity at zero twist angle (∼ 8 nm), thus requiring a large number of atoms in the structure.On the other hand, low-angle MoSe 2 /WS 2 have shown an interesting interplay of intra-and inter-layer exciton hybridization because of the nearly degenerate conduction bands.This feature, however, is still under debate in the literature [3,7,[13][14][15].The underlying electronic structure is thus crucial to resolve this controversy and is exactly the kind of problem that require large-scale GW calculations.Here we considered MoSe 2 /WS 2 moiré superstructures with twist angles between 9.3 • and 26.6 • (Fig. 4) that have corresponding unit cells of up to 984 atoms.We emphasize that in all cases, the strain of the individual monolayers is < 0.01% compared to the experimentally determined lattice constants [54,55], which is important because the bandgap is very sensitive to strain [56,57].The G 0 W 0 bandgap of the MoSe 2 /WS 2 bilayer depends on the twist angle changing from 1.86 eV (9.3 • ) to 1.92 eV (26.8 • ), in line with experimental observations of the exciton emission energy [13].Our GW calculation on the 984-atom heterostructure takes 42 hours on only 1536 cores which is a factor 30.000 faster than with a plane-wave algorithm, see estimate in the SI.Such large-scale GW calculations are an ideal starting point for further analyzing the electronic structure of these materials.For example, with our GW algorithm, the calculation of deep moiré potentials [4] are within reach, which are caused by atomic reconstruction and height variations.Both crucially influence the interlayer screening that is captured by the GW method.On top of a GW calculation, the Bethe-Salpeter equation [17,19] will enable the study of excitons in large-scale moiré structures.Our computationally efficient scheme also holds great promise for nanoscale excited-state dynamics in low-dimensional materials.Current state-of-the-art studies only report the dynamics in clean monolayers [58][59][60] and models [61,62].
Summarizing, we have presented a low-scaling GW algorithm with periodic boundary conditions employing localized basis functions and the minimum image convention.The GW algorithm is numerically precise and requires up to five orders of magnitude less floating point operations compared to plane-wave codes.We carried out a G 0 W 0 calculation on a MoSe 2 /WS 2 heterostructure with 984 atoms in the unit cell which is an order of magnitude more than the state of the art [25].We are fully convinced that our GW algorithm will enable routine applications of GW and its time-dependent variants to low-dimensional, nanostructured materials that were previously computationally highly challenging.

CODE AND DATA AVAILABILITY
The low-scaling GW algorithm is implemented in the opensource CP2K package [48] which is freely available from github [49].Inputs and outputs of the calculations are also available on github [63].

Figure 2 .
Figure 2. Number of floating point operations (real double precision)needed for executing G 0 W 0 algorithms.Black: low-scaling G 0 W 0 algorithm from this work using a TZVP-MOLOPT basis set[47], gray: plane-wave-based G 0 W 0 algorithm, Eq. (9).Underlying computational parameters are typical for monolayer MoS 2 , MoSe 2 , WS 2 and WSe 2 , see detailed raw data available in the SI.

atFigure 3 .
Figure 3. Execution time of a G 0 W 0 calculation for MoSe 2 9 × 9 -14 × 14 supercells (TZVP-MOLOPT basis set) on Supermuc-NG (Intel Skylake Xeon Platinum 8174).Magenta points show the computational cost to diagonalize the polarizability χ(k) which allows us to remove all spurious negative eigenvalues of χ(k) to ensure numerical stability.Dashed lines show a fit αN β at to the execution time, where α and β are fit parameters.Raw data is available in the SI.

Figure 4 .
Figure 4. Band gap of a MoSe 2 /WS 2 heterostructure as function of the twist angle.Inset: Unit cell (black rhomboid) for 19.4 • twist angle contains 984 atoms.
MoS 2 WSe 2 MoSe 2 Figure 1.G 0 W 0 bandgap of monolayer WS 2 , MoS 2 , WSe 2 and MoSe 2 calculated from Eq. ( MoS 2 MoSe 2 WS 2 WSe 2 ×14 supercell (588 atoms in the unit cell).We conclude that the GW algorithm from this work can be used to study unit cells which are as large as a 10 ×10 supercell or larger.In the Supporting Information, we show additional convergence tests on the basis set size, the number of time and frequency points, the k-point mesh size, filter threshold for sparse operations, and the vertical box height.