Physics-inspired structural representations for molecules and materials

The first step in the construction of a regression model or a data-driven analysis, aiming to predict or elucidate the relationship between the atomic scale structure of matter and its properties, involves transforming the Cartesian coordinates of the atoms into a suitable representation. The development of atomic-scale representations has played, and continues to play, a central role in the success of machine-learning methods for chemistry and materials science. This review summarizes the current understanding of the nature and characteristics of the most commonly used structural and chemical descriptions of atomistic structures, highlighting the deep underlying connections between different frameworks, and the ideas that lead to computationally efficient and universally applicable models. It emphasizes the link between properties, structures, their physical chemistry and their mathematical description, provides examples of recent applications to a diverse set of chemical and materials science problems, and outlines the open questions and the most promising research directions in the field.

The first step in the construction of a regression model or a data-driven analysis framework for matter at the atomic scale involves transforming the Cartesian coordinates that describe the positions of the atoms in the form of a representation that obeys the same symmetries as the properties of interest, and in general reflects the physical nature of the problem. The link between properties, structures, their physical chemistry and their mathematical description is strongest when it comes to applications aimed at determining a precise correspondence between atomic configurations and the quantities that one might compute by a quantum mechanical electronic-structure calculation or measure experimentally. The development of atomic-scale representations have played, and continue to play, a central role in the success of machine-learning methods that rely on this correspondence, such as interatomic potentials, as well as generic property models, structural classifiers and lowdimensional maps of structures and datasets. This review summarizes the current understanding of the nature and characteristics of the most commonly used structural and chemical descriptions of molecules and materials, highlighting the deep underlying connections between different frameworks, and the ideas that lead to computationally efficient and universally applicable models. It gives examples of recent applications to a diverse set of chemical and materials science problems, and outlines the open questions and the most promising research directions in the field. The last decade has seen a tremendous increase in the use of data-driven approaches for the modeling of molecules and materials. Atomistic modeling has been a particularly fertile field of use; applications range from the analysis of large databases of materials properties, 1 to the design of molecules with the desired behavior for a given application. 2 Machine learning techniques have been applied to devise coarse-grained descriptions of complex molecular systems, [3][4][5][6][7][8][9] to build accurate and comparatively inexpensive interatomic potentials, [10][11][12][13][14][15][16][17] and more generally to predict, or rationalize, the relationship between a specific atomic configuration and the properties that can be computed by electronic-structure calculations [18][19][20][21][22][23][24] .

CONTENTS
All of these applications to atomic-scale systems share the need to map an atomic configuration Aidentified by the positions and chemical identity of its N atoms {r i , a i }, and possibly by the basis vectors of the periodic repeat unit h -into a more suitable representation. This mapping associates A with a point in a feature space, which is then used to construct a machine-learning model to regress (fit) a structure-property relation, to cluster (group together) configurations that share similar structural patterns, or to further map the conformational landscape of a data set onto a low-dimensional visualization. The terms descriptor or fingerprint are often used in chemical and materials informatics to indicate heuristically-determined properties that are easier to compute but reflect more directly the relationship between a structure and the quantities one wants to predict. 25 Examples of descriptors include the fractional composition of a compound, the electronegativity of its atoms, a low-level-of-theory determination of the HOMO-LUMO gap of a molecule. In this review we focus on a more systematic class of mappings that use exclusively atomic composition and geometry as inputs, and aim to characterize precisely the instantaneous arrangement of the atoms, for which we use the term representation. We will be especially interested in those representations that apply geometric and algebraic manipulations to the Cartesian coordinates, to transform them in a way that fulfills physically-informed requirements: smoothness and symmetry with respect to isometries. Commonly used representations include atom-centered symmetry functions 10,26 and the smooth overlap of atomic positions (SOAP) 27 . It is important to note that representations can be expressed using different mathematical entities: at the most basic level, the space of features can take the form of a vector space, in which each configuration is associated with a finite-dimensional vector whose entries are explicitly computed by the mapping procedure. Depending on the application, however, it may be simpler or more natural to express the relationship between pairs of configurations in terms of a distance metric D(A, A ), or a kernel function K(A, A ). As we will see, distance or kernelbased formulations implicitly define a feature space, that in most cases can be expressed (at least approximately) in terms of a vector of features, ad so can be seen as equivalent to a representation of individual structures.
While one can trace the origins of different representations to specific subfields of computational chemistry and materials science, the fact that representations should describe precisely the nature and positions of each atom means that they often are not specialized to a given application, but can be used with little modification for any atomistic system, from gasphase molecules to bulk solids [28][29][30] . This generality, however, does not mean that representations are completely abstract or disconnected from physical and chemical concepts. Over the past few years, it has become clear that representations that reflect more closely some fundamental principles -such as locality, the multi-scale nature of interactions, the similarities in the behavior of elements from the same group in the periodic table -usually yield models that are more robust, transferable and data-efficient. The link between a representation and the physical concepts it incorporates is usually mediated by the strategy one uses to fit the desired structure-property relations: it is often possible to show an explicit relationship between linear regression models built on the representation of a structure and well-known empirical forms of interatomic potentials (such as body-ordered, or multipole expansions), and more complex, non-linear machinelearning schemes built on the same features improve the flexibility in describing structure-property relations, albeit at the price of a less transparent interpretation of their behavior.
Given the central role of structural representations in the application of data-driven methods to atomistic modeling, it is perhaps not surprising that considerable effort is being dedicated to understanding and improving their properties. These efforts follow several directions. First, the efficient, scalable, and parallel implementation of the construction of a given set of features is essential to ensure computational efficiency. Second, reduction in the number of features that is used to describe the system reduces the computational effort, and often improves the robustness of the model: feature selection aims at identifying the most expressive, yet concise, description of the system at hand. Third, it is often desirable to fine-tune a representation so that it facilitates training a model on a small number of reference structures, by incorporating more explicitly the available prior knowledge.
This review aims to summarize recent work on the construction of efficient and mathematically sound representations of atomic and molecular structures, with a particular focus on the use for the regression of atomic-scale properties. Rather than focusing on a historical overview, we intend to provide a snapshot of the current insights on what makes a good representation, supporting our considerations with recent publications, and providing a perspective of the most promising research directions in the field.

A
An atomic structure Ai An environment centered on the i-th atom of the structure A ri Position of the i-th atom rji Vector separating the i-th atom and its j-th neighbor, rj − ri Q Generic continuous index enumerating the components of an atomic representation q Generic discrete index enumerating the components of an atomic representation Q|A A representation of a structure A indexed by an unspecified label or set of labels Q φ(Ai) Feature vector (with elements indexed by q) associated with an atom-centered environment, φq(Ai) = q|Ai Φ Feature matrix combining the features associated with multiple structures/environments ϕq A column in a feature matrix, where (ϕq)i = φq(Ai) F (Ai) An atom-centered property, or its systematic approximation in terms of an atom-centered representatioñ F (φ) A non-linear model that approximates F (Ai) using the feature vector φ(Ai) K(A, A ) A (non-)linear kernel computed between two structures or environments, represented by the corresponding feature vectors φ(A) D(A, A ) A distance computed between two structures or environments |ρ Structure representation based on a smooth atom density |ρi Representation of an environment centered on atom i based that results from symmetrizing |ρ over translations |ρ ⊗ν i Symmetrized ν-point correlation of the atomic density built on the atom-centered representation |ρi The mapping between structures and feature space should obey fundamental physical symmetries (equivalent structures should be mapped to the same features); should be complete (inequivalent structures should be mapped to distinct features); should be smooth (continuous deformations of a structure should map to a smooth deformation of the associated features). Furthermore, whenever dealing with datasets that are not homogeneous in molecular size, the representation should be additive: a structure should be decomposed in a sum of local environments (usually atomcentered), ensuring transferability and extensivity of predictions. |δ Dirac-δ limit of the smooth atom density |ρ . Analogous symmetrized versions are indicated as |δi and |δ ⊗ν i |V Atom-density field representation, suitable to describe long-range correlations

III. REPRESENTATIONS FOR MATERIALS AND MOLECULES
Even though this review has no intention of providing a comprehensive discussion of the development of descriptors for atomic structures, it is worth providing a brief overview. A "data-driven" philosophy emerged early in the field of chemical and molecular science, where the combinatorial extent of the space of possible molecules, 31 and the possibility of accessing this space with comparatively simple synthetic strategies, encouraged the development of quantitative structure/property relationships (QSPR) techniques, attempting to map 32 descriptors of molecular structure -based on cheminformatics fingerprints, 33,34 chemical-intuition driven descriptors 35 , molecular graphs, 36 or indicators obtained from quantum chemical calculations 37 -to the behavior of a selected compound, usually focusing on properties of direct applicative interest [38][39][40] such as solubility, toxicity, 41 or pharmacological activity. 42  This approach should be contrasted with that of "bottom-up" predictions, that aim to use models of the interactions between the atomic constituents of a material to simulate the behavior of the system on an atomic time and length scale. Starting from the early days of molecular simulations [44][45][46][47] the objective was to predict the energy, the forces, or any other observable of interest, for a specific molecular configuration, and use them to search for (meta-)stable configurations, or to simulate the evolution of the system by molecular dynamics 48,49 . In the absence of reliable reference values for the properties of specific atomic configurations, interatomic potentials (also called empirical force fields) were built using physically-inspired functional forms, combining harmonic terms to describe chemical bonds with Coulomb and 1/r 6 terms to describe electrostatics and dispersion. Their (few) parameters were determined by matching the values of experimental observables, such as cohesive energies, lattice parameters and elastic constants. The continuous increase in computational power, and the availability of electronic-structure techniques with a better cost-accuracy ratio [50][51][52] has made it possible to compute extremely accurate energies and properties of specific configurations. This has opened the way to ab initio simulations of materials 47 , but also provided a viable alternative to empirical functional forms for the construction of interatomic potentials. Starting from the simplest compounds 53 , and then gradually increasing in complexity 54 , molecular potential energy surfaces fitted by interpolating between a comparatively small number of ab initio reference calculations provided the first practical applications of this idea. The possibility of combining very accurate calculations of the electronic structure of atomic systems with sampling of the statistics and dynamics of the nuclei on the electronic potential energy surface has allowed theoretical predictions that do not only agree with experimental results 53 -they can predict experiments 55 two decades before measurements become precise enough to verify the theoretical values. 56 Even though the ultimate goal of QSPR models and machine-learned potentials is the same -predicting scientifically and/or technologically relevant properties of molecules and materials -the approaches they follow to achieve this goal are quite different, which is reflected in the way an atomic structure is translated into an input for a machine-learning model. Cheminformatics descriptors, or fingerprints, are built ad-hoc incorporating both descriptors of molecular structure and composition, and easy-toestimate molecular properties. They usually rely on a considerable amount of prior knowledge, are often system and problem specific, and are meant to label a compound rather than a specific configuration of its atoms. This is a logical consequence of the fact that QSPR aims for an end-to-end description of a thermodynamic property, which is not an attribute of an individual configuration, but of a thermodynamic state of matter. In the case of bottom-up modeling, instead, one aims first at building a very accurate surrogate model that is capable of reproducing precisely and inexpensively the outcome of quantum calculations for a specific configuration of the atoms. The end goal of predicting thermodynamic properties is achieved by coupling these prediction with statistical sampling methods 48,49,57 aimed at computing averages over the appropriate classical (or quantum 58 ) distribution of atomic configurations. As a consequence, the representations used as inputs of these surrogate quantum models are usually rather generic, are constructed using exclusively atomic coordinates and chemical species, and aim to establish a precise mapping between a specific structure and the associated atomic-scale quantities. We focus our discussion on this latter class of features, and also do not attempt to cover recent, and rather successful, attempts to use descriptors that rely on low levels of quantum mechanical theory to predict high-end, accurate molecular properties. 59 In the rest of this section, we discuss the properties that are desirable for a representation used in atomistic machine learning, which are graphically summarized in Figure 1. The mapping between structures and features should be consistent with basic symmetries -i.e. reflect the fact that the properties associated with a structure do not change when the reference system or the labelling of identical atoms are modified; be smooth, so that models built on the features inherit a regular behavior with changing atomic coordinates; be complete, so that fundamentally distinct configurations are never mapped to the same set of features. Furthermore, many machine-learning tasks benefit greatly from being based on local features, which describe atoms or groups of atoms. Even though this is a less stringent requirement and, as we discuss below, global descriptors have been used very successfully, representations based on local environments are usually associated with higher transferability, reflecting a "divide and conquer" approach to materials modeling 60,61 . Finally, less fundamental but not less important requirements are the numerical stability and computational efficiency of the structure-representation mapping, which we discuss in Section VIII.

A. Symmetry
The Cartesian coordinates of the atoms encode all the information that is needed to reconstruct the geometry of a structure. Yet, it is obvious that they cannot be used directly as the input of a regression model. The fact that the Cartesian description of a molecule depends on its absolute position and orientation in space, and the order by which atoms are listed, means that configurations that are completely equivalent can be represented by many different Cartesian values, which makes any regression, classification or clustering scheme inefficient and potentially misleading. Over the years, many different approaches have been proposed by which translations, rotations, inversion and atom permutation symmetries can be enforced, which is reflected in the variety of alternative frameworks to achieve an effective representation to be used of the input of an atomistic machine-learning scheme. In fact, symmetry is such a central principle underpinning these efforts that it can be used to construct a "phylogenetic tree" of representations, organized according to the strategy that is used to incorporate symmetry in their construction, as shown in Figure 2.
The need to remove the trivial symmetries, namely the dependency of the Cartesian coordinates on the origin and orientation of the reference system, has been recognized very early in the field of chemical and materials modeling. Different sets of internal coordinates 62 (bonds, angles, torsions) have been proposed, based on chemical intuition, as invariant descriptors of molecular geometry, and most of the molecular forcefields that have been so effective in the modeling of biological systems 63-66 rely on internal coordinates to define bonded interactions. A collection of internal coordinates that is sufficient to fully characterize the geometry of a structure, often referred-to as the Z-matrix, is a paradigmatic example of this class of representations. Even though the efficiency of this approach has often been questioned 67,68 , particularly because there is no unique way to define the Zmatrix, internal coordinates are still ubiquitous, and are effective whenever the system being studied has a well-defined, persistent bonding pattern (see Ref. 69 for a recent review). In these cases, internal coordinates can be seen as the initial step in the construction of discretized molecular representations, such as a molecular graph. Even though very widely used in chemical machine learning 2,70 , these graph based schemes are not meant to describe the exact arrangement of the atoms, but just their bonding pattern, and so fall outside the scope of this review.
The limitations of an internal-coordinates description become most apparent when one wants to describe a chemically-active system, as the bonding patterns can change during the course of a simulation, and therefore the invariance to atom index permuta-tions becomes crucial to achieve a consistent model. The Empirical Valence Bond (EVB) method 71 has been used to model bond-breaking events, but the generality of the EVB approach is limited as the possible assignments need be pre-determined. This led to the development of representations that are intrinsically independent on the ordering of the atoms, such as permutation-invariant polynomials (PIPs) 11,[72][73][74] which are obtained by summing functions of the internal coordinates over all possible orderings. In their original implementation, the exponentially increasing cost of evaluating these sums limited their applicability to molecules with a small number of degrees of freedom.
The idea that atomic structures had to be properly symmetrized before being passed to a regression algorithm was understood early in the context of constructing molecular potential energy surfaces, 75 and applied to the first wave of "neural network potentials" 76 . It was made central to the construction of interatomic potentials for condensed-phase materials by Behler and Parrinello, who in Ref. 10 introduced the concept of atom-centered symmetry functions (ACSF). Similarly to PIPs, ACSF are translationally and rotationally invariant because they are functions of angles and distances, and permutationally invariant because they are summed over all possible atomic pairs and triplets within an atomic environment. The computational cost of ACSF is kept under control by restricting the range of interactions (which we discuss further in subsection III C) and the body order of the correlations considered. Despite these restrictions, ACSF models have been shown to achieve comparable accuracy to that reached by PIPs 77 . Indeed, the recently proposed atomic PIPs 78 use the same polynomial basis as global PIPs, but avoid the unfavorable scaling with increasing molecule size by combining locality (via a distance cutoff) and a truncation of the order of the expansion.
Internal coordinates are also the fundamental building block of molecular matrix representations, which are based on functions of the interatomic distances within a structure. Coulomb matrices, which list the formal electrostatic interactions q i q j /r ji between each atomic pair in a structure, have been extensively explored in early applications of the machine learning of molecular properties 18 , with the main limitation being connected to the lack of permutation invariance 79 , which has also been tackled by approximate symmetrization, summing over a manageable number of randomized orderings of the atoms 80,81 . We discuss alternative approaches to symmetrizing Coulomb matrices, as well as other representations based on molecular matrices, in Subsection III B.
The phylogenetic tree in Fig. 2 shows that a large number of existing representation take a different strategy to achieve symmetrization: rather than using internal coordinates that are inherently invariant to rotations and translations, they first -implicitly or explicitly -describe the system as an atom density i g(x − r i ), obtained by summing over localized functions centered on the positions r i of all atoms in the system. Such a density is naturally invariant to permutations, and only at a later stage one proceeds to symmetrize it over translations and rotations. We discuss in great detail this second approach in Section IV. It suffices to say, at this point, that even if the construction of symmetrized density representations is conceptually very different from those based on internal coordinates, there are many direct and indirect links between the two branches, sketched in Figure 2, which we will discuss when reviewing specific classes of representations.

B. Smoothness
The overwhelming majority of atomic-scale properties are continuous, smooth functions of the atomic coordinates. Function regularity is crucial for creating efficient ML models, and is therefore one of the requirements for a good structural representation. Features constructed from a symmetrized atom density are naturally smooth functions of atomic coordinates, and it is usually not a problem to maintain this regular behavior upon symmetrization over translations and rotations. The level of smoothness can be adjusted by smearing the atomic density, or by expanding it on a smooth basis (effectively a Fourier smoothing), as we discuss more extensively in Section IV. Internal coordinates are also usually smooth, but the process of manipulating them to achieve a permutation invariant representation can affect the smoothness of the mapping.
One way to obtain permutation invariance without incurring the exponential scaling of the cost associated with enumerating all possible permutations of atomic indices involves sorting the entries in a distance or Coulomb matrix 80,82 , an approach that has also been used with permutation invariant vectors (PIV) 83 , and "bag of bonds" features (BoB) 84 . Computing the eigenvalues of (functions of) interatomic distances, which underlies the SPRINT method 85 as well the overlap matrix eigenvalue fingerprints 86,87 , also effectively achieves permutation invariance by similar means, since the vector of eigenvalues is taken to be sorted in ascending or descending order. The earliest implementation of the DeepMD scheme 88 also relied on sorting a local distance matrix. However, the sorting operation introduces derivative discontinuities in the mapping between Cartesian coordinates and features, because the order of the distance vector changes as atoms are displaced in the structure. Figure 3 illustrates the discontinuity of the derivatives of a function that is built from an ordered list of features. Consider a system of 3 atoms that is uniquely defined by the 3 interatomic distances r i , where the index i denotes the position of the interatomic distance r i in the ordered list of distances. We define a smooth function of the sorted distances, f = i c i r i − r 0 i 2 parameterized by c and r 0 . The function f is indeed invariant to the permutations of the atom order in the trimer, but at the price of introducing kinks in f and discontinuities in its derivative when the distance ordering changes. Fitting any smooth function of the trimer geometry by optimizing the parameters c and r 0 would necessarily lead to poor approximation accuracy.
The lack of regularity has implications for the accuracy and stability of machine-learning models built on such features, as has been shown recently by using a Wasserstein metric to compare Coulomb matrices in a permutation-invariant manner 89 . In this context it is worth noting the remarkable connection linking the Euclidean distance between vectors of sorted distances and the Wasserstein distance between radial distribution functions (Section III.F in Ref. 90), which builds a formal bridge between conceptually unrelated families of atomic-scale representations.

C. Locality and additivity
The overwhelming majority of empirical interatomic potentials are expressed as an additive combination of local terms, or of long-range pairwise contributions. Early models built to fit molecular potential energy surfaces were built explicitly as a function of the coordinates of all atoms in the system. 53,91,92 Besides the issues of computational cost, this approach is problematic, as it hinders the application of the potential to a molecule with a different number of atoms, or chemical composition. The work of Behler and Parrinello 10 did not only have the merit of emphasizing the importance of symmetries in atomistic machine learning, but it also applied to ML interatomic potentials an additive expansion of the molecular energy E(A), writing it as a sum of atomcentered contributions, E(A) ≈ i∈A E(A i ). The notion of an additive decomposition of properties, which is also implicit in the functional forms of most interatomic potentials, has far reaching consequences for data-driven models. Combined with the requirement that the atomic contributions only depend on the position of atoms within a finite range of distances, which is needed for the method to be computationally practical and is supported by fundamental physical principles 93 , the additivity assumption breaks down the problem of predicting the properties of a complex structure into simpler, short-range problems. This has non-trivial consequences in terms of the data efficiency of the model, as discussed in Subsection VIII C. An additive decomposition is also the most straightforward way to ensure extensivity of predictions 94 , i.e. that the prediction of a property for two copies of a molecule at infinite distance from each other is equal to twice the prediction for a single molecule.
It is not by chance that also in the field of molecular machine learning, for which many of the early representations aimed at a global description of a molecule 18,29,95,96 , most of the recent approaches have moved to additive, atomcentered representations 97,98 , that yield more accurate and transferable models for properties that are expected to be extensive 99 . Oftentimes it is possible, and relatively straightforward, to modify a global representation to describe an atom-centered environment 78,86,100 , or to combine atom-centered representations to build a global description 101 , e.g. by summing or averaging the values of all the atomcentered features that are present in the structure, as we discuss in Section VII B. The connection between locality, additivity, and the nature of the structureproperty relation that one wants to model is essential to the construction of effective and transferable machine-learning models.

D. Completeness
The requirements of symmetry, smoothness and locality can be seen as geared towards reducing the complexity of the structural representation, eliminating redundant structures, reducing the resolution to the intrinsic length scale over which the target property exhibits substantial variations, and breaking down complicated compounds into simple fragments. This simplification should not, however, come at the expense of the completeness of the representation, meaning that the mapping between Cartesian and feature spaces should keep inequivalent structures distinct. For example, it has been known for some time that a histogram of interatomic distances (discarding the identity of the connected atoms) is insufficient to fully characterize a structure composed of more than three atoms 27,102,103 . More recently, counterexamples have emerged showing that atom-centered correlations -at least those of low order -are also insufficient to preserve the injectivity of the structure-feature mapping (see Ref. 104 and Section VI B for a more thorough discussion).
Besides completeness in terms of the geometric structure-feature mapping, one should also consider whether for a chosen regression scheme the featureproperty mapping can be converged to arbitrary accuracy. More complex, non-linear models can often provide good results even when using a representation that involves excessive smoothing, or an highly truncated version of a family of features. The interplay between model and features is discussed in more detail in Section V, and the problem of completeness in Section VI.

IV. SYMMETRIZED ATOMIC FIELD REPRESENTATIONS
As discussed in the previous Section, a multitude of representations have been introduced over the past decade, attempting to incorporate basic principles of symmetry and locality at the very core of atomistic machine learning. The differences between them are much less fundamental than it appears at a first glance, and in fact several works have recently pointed at the existence of a unified framework, in which an explicit formal connection can be established between the vast majority of representations. 90,[105][106][107] In this Section we summarize the construction of this class of features, that we refer to as "symmetrized atomic field representations", emphasizing the role played by symmetry and locality, as well as hint to the connection between this class of features and a linear mapping between structure and properties, which is discussed in more detail in Section V.

A. Dirac notation for atomic representations
We formalize a notation, extending the one introduced in Refs. 90,105 and used in Ref. 108 to compare different kinds of local and global representations, to express the feature vectors associated with a representation in a way that mimics Dirac notation in quantum mechanics. The notation is convenient to perform manipulations of the features that can be expressed as linear transformations, to emphasize the fact that several widely adopted representations are of the same nature, differing primarily in the choice of basis, and allows for a more expanded layout of complicated feature indices compared to subscript notations. Each structure A is associated with a ket |A; rep that also contains a description of the nature of the representation. Indices that enumerate the corresponding features are associated with the bra Q|. Both the ket and the bra indices can (and will) be used with some looseness, to emphasize the most relevant elements of a representation while keeping the notation slim. For instance, as shown in Fig. 4, one can indicate explicitly multiple bra indices when their meaning in the definition of a representation is important, separating with a semicolon groups of indices that are conceptually related, or condense them in a compound index when the substructure is irrelevant. Moreover, when discussing the construction of a representation, the reference structure is not important, and so one may drop the structure index from the notation and write |rep instead of |A; rep . Vice versa, when the representation of choice is well-establishede.g. when writing expressions that describe the regression scheme after having discussed the choice of representation -one may omit the specifics of the representation and write |A . The core principle we adhere to is that the ket indicates the nature of the representation (the object being described, the body order, the symmetries that the representation obeys, the way it is constructed or its hyperparameters) while the bra indicates a discretization that might be a matter of computational convenience (e.g. it might depend on the basis on which |A is expressed), such that -at least in the limit of a complete basis -the discretization is irrelevant, and different choices are connected by a linear transformation. At the most basic level, this notation can be seen as a way to indicate more expressively the nature of the representation used, and to tidily enumerate the components of the associated feature vector. To give a concrete example, the notation used in Refs. 27,101 to indicate the components of a SOAP feature vector associated with the environment A i , centered around the i-th atom of structure A, can be written as where ρ ⊗2 i hints at the fact that SOAP features can be derived as a 2-point correlation of the i-centered atom density, a 1,2 indicates the chemical nature of the atoms, n 1,2 radial basis channels and l an angular momentum channel. The semicolons indicate that the (a 1 n 1 ) and (a 2 n 2 ) indices are conceptually linked in the construction. The l index is associated with both densities, as a consequence of the process that makes the SOAP power spectrum invariant to rotations. When the nature of the indices is unimportant, we use generic indices, namely Q| when substituting continuous indices, and q| when substituting one or more discrete indices.
Much like Dirac notation in quantum mechanics, however, the real value of this formalism is seen in how it helps avoiding errors when manipulating representations. For example, a change in the basis that is used to practically compute a representation can be written as a linear transformation, where T |Q indicates the coefficients that enact the change of basis. Note that, much as it is the case in quantum chemistry, non-orthogonal bases introduce some ambiguity in the bra-ket notation, because the coefficients in the expansion of a function in the basis differ from the scalar product between the basis and the function -the two being related by the overlap matrix of the basis. Thus, the notation we use here captures the gist of the various manipulations, but should be treated with some care when translating it into a practical implementation if the basis used is not orthonormal. Furthermore, features could be closely related without being linked to each other by a linear transformation, for instance by the value of a hyperparameter r cut . In these cases, one could indicate in the ket the parameters that distinguish the two representations, e.g. |r cut = 5Å , reserving the bra for the indices associated with the expansion of the abstract representation onto a concrete basis. The scalar product between the features of two structures A and A (as it could be used to define a linear kernel K(A, A ) that expresses the similarity between the configurations) reads where we use the convention A|Q = Q|A to obtain an expression that is reminiscent of a completeness relation dQ |Q Q| = 1. A similar expression can be written to describe a linear model predicting the value of a target property E for a structure A. When using a representation |A; rep to describe structures, E can be approximated as with a slight abuse of notation because usually it does not make sense to see Q|E; rep as an inner product, and so Q|E; rep indicates just the regression weights for a model based on |A; rep . The bra-ket notation we use implicitly assumes linearity in the transformation between different choices of basis, and in the modeling of target properties. All of the expressions discussed here as integrals over a continuous index can be formulated as sums over (finitely or infinitely) countable, discrete indices Even though the features can be used as an input of an arbitrarily complex nonlinear regression scheme, we will often investigate their behavior in the context of linear models, because they reveal more transparently how a given representation reflects structure-property relations.
A pattern we use frequently in what follows, and that mimics a construction used in quantum mechanics, is the combination of multiple kets to build a tensor-product space, e.g. The construction of a tensor-product representation is well-defined even without indicating explicitly the basis used to describe either side of Eq. (6), and it is often possible to use either an explicit Cartesian product of the bases on the right-hand side, or a combined basis using only |A as a special case of Eq. (6) in which A ≡ A , and rep ≡ rep can be omitted. Finally, we can consider the action of an "operator" on a ket, that is to be interpreted as a linear map that transforms the atomic structure, or the nature of the representation. Taking for instance the operator i associated with inversion symmetry,î |A indicates the representation associated with structure A after the coordinates of all atoms have been reflected relative to the origin. By summing over the operators associated with a symmetry group, an operation which is also referred to as Haar integration 110 , one can build symmetrized representations that are invariant under the actions of the elements of the group, e.g. for the C i point group, When the resulting symmetric representation is used often, and the symmetry group is clear from the context, we indicate the averaging with an overline and omit the explicit indication of the group it has been symmetrized over, e.g., | A ⊗ A Ci → |A ⊗ A → |A ⊗2 . Figure 5 illustrates the notation and the Haarintegration in one dimension. Two distinct functions, f and g are plotted using their usual real-space features, f (x) ≡ x|f and g(x) ≡ x|g . Applying inversion yields x|î |f = f (−x). An inversioninvariant feature can be created by symmetrizing: |f ⊗1 = |f +î |f , but our choice of f and g leads to a degenerate feature as |f ⊗1 = |g ⊗1 . A second or-der feature may be obtained by generating the tensor product of the functions, e.g. |f ⊗ |f , which in real space results in x 1 ; . Symmetrizing this tensor product yields features |g ⊗2 and |f ⊗2 that are also inversion-invariant, but are still able to distinguish between the two functions.

B. Global field representations
The starting point for the construction of a symmetry-adapted field representation is a field that describes the structure in terms of the distribution of its atoms -or, more generally, of points that are associated with the building blocks of the material, as one would have in a coarse-grained model. In the simplest possible case, one would take localized functions g centered on each atomic position r i and define where x|g ≡ g(x) is a localized function (e.g. a Gaussian) and the ρ in the ket indicates the kind of field used to describe the structure. As we discuss in more detail in Section IV E, the atomic density functions can be either finite-width Gaussians, which leads to representations akin to SOAP features 27 , or Dirac δ distributions, which recovers representations similar to the moment tensor potentials 111 or the atomic cluster expansion 106 . To indicate the g → δ limit, we use the notation |ρ → |δ . Atoms (or more generally, "point particles") can be further characterized by internal attributes, that could be discrete (e.g. the chemical nature of an atom, a i ) or continuous (e.g. a molecular dipole u i ) In this form, Eq. (10) can be seen as an abstraction of the many real-space "voxel" representations of materials. 112,113 The ket |A; ρ defined by expressions like (9) or (10) could be equally well expressed in a different basis, e.g. expanded in plane waves Eqs. (9) and (11) contain the same amount of information. Even though the choice of a basis can be very important to simplify analytical derivations or practical implementation, representations can be regarded as abstract objects that can be defined independently of the basis set, much as it is the case for the wavefunction in quantum mechanics.

C. Translational invariance and atom-centered features
One way to make x|ρ translationally invariant is to sum over the continuous translation group, dt x|t |ρ . As discussed in Ref. 90, summing directly over the atom density eliminates all structural information. Information loss is a usual issue with Haar integration, as exemplified in Figure 5. One can avoid or reduce it by summing over tensor products of the atom density field. Considering the case in which atoms are described only by their position and chemical identity, integrating over translationst yields a two-point density correlation function (12) whereg indicates the cross-correlation of two of the localized density functions. We indicate that the representation has been obtained by averaging over translations the tensor product of two density fields with the superscript notation ρ ⊗2 , and we separate with a semicolon groups of feature indices that are associated with each field, as discussed in Section IV A. Note that the representation in Eq. (12) has a large null space, as it depends only on x 1 − x 2 . One could then re-define it by labelling features using a single position vector, or transform it in a plane wave basis: a 1 ; a 2 ; k|ρ ⊗2 = dx a 1 0; a 2 x|ρ ⊗2 k|x = a 1 k|ρ a 2 k|ρ (13) where k|x ≡ e −ik·x , and where the second equality is a consequence of the convolution theorem. One sees that the translationally-symmetrized density is equivalent to the diffraction pattern of the atomic structure, that has been used as a descriptor to classify crystalline configurations. 25 This construction can be taken as an inspiration to introduce an atom-centered representation where r ji = r j − r i . The fact it is atom centered (and hence translationally invariant) is hinted at by the subscript notation ρ i , and so in what follows we only use this subscript to distinguish it from its nonsymmetrized counterpart (9) and simultaneously to indicate the central atom index. When expressing a representation centered around atom i without emphasis on its precise nature, we will use the notation |A i . Writing the symmetrized two-point density correlation in terms of Eq. (14) clarifies how an atomcentered representation is a natural consequence of the translational symmetrization: When building a linear model, this expression implies an additive decomposition of the target property, as well as the use of separate models depending on the nature of the central atomic species: As discussed in Section V A, this expression can be taken as the prototype of all pair potentials, and higher-order of many-body interaction can be incorporated by taking higher tensor powers before symmetrization, or in the subsequent step of rotational averaging. Localization can be enforced by introducing a cutoff function in the definition (14). This is far from being an inconsequential operation, as it introduces an error (atomic energies and properties cannot depend on neighbors farther than this limit) but often results in more robust models, which perform better in the data-poor regime. We discuss this in more detail in Section VIII C. Copies of the atomcentered density are evaluated at ν separate points, and the tensor product is averaged by simultaneously rotating all densities.

D. Rotational invariance and body-ordered representations
The atom-centered representation (14) is translationally invariant, but does depend on the orientation of the structure. One should then proceed to perform Haar integration over the rotation group and (possibly) over inversion.
We can define the (ν + 1)-body order symmetrized field representation as This can be expanded on an explicit position basis ( Fig. 6) making it clear how |ρ ⊗ν i corresponds to a symmetrized, ν-point correlation of the atom density centered on the i-th atom -a (ν + 1)-point correlation function, in the language used in statistical mechanics to describe the structure of liquids 114 . Similar to the case of Eq. (12), this object has a large null space (e.g. in the ν = 1 case it only depends on x 1 = |x 1 |). As discussed in Ref. 90, one can choose a more concise enumeration of the real-space correlations in terms of distances and angles, that reduces in the limit g → δ of δ-like atom density to a sum over distances and angles between atoms. For instance, for the ν = 2 case one can write a 1 r 1 ; a 2 r 2 ; ω|δ ⊗2 i ∝ jj δ a1aj δ a2a j δ(r 1 −r ji )δ(r 2 −r j i )δ(ω−r ji ·r j i ), where we use ρ → δ to indicate that the correlation function is built on the g → δ limit of the atom density field. Expressions of this kind reveal the close connection between symmetrized-field representations and atom-centered symmetry functions 10,19,115 , as well as equivalent constructions such as that used in DeepMD 116 and the FCHL features 97,98 . Features that describe a chemical environment are written as a sum over tuples of neighbors of appropriate functions of their distances and angles, and can be seen as just a different choice of basis set for Eq. (19) a 1 a 2 k|δ ⊗2 Note that we choose to symmetrize the atomcentered description |A; ρ i -given that this is the procedure that recovers most of the existing representations -but one could as well proceed by averaging over tensor products of the translationally invariant representation of the full structure |A; ρ ⊗2 as it was done for instance in Ref. 117. Doing so results in the appearance of cross terms involving correlations between densities centered on different atoms, which could be used to systematically incorporate in this framework machine-learning approaches based on convolutional, and message-passing, neural networks that combine information centered on neighboring atoms. 20,118 E. Density correlations in an angular momentum basis More concise (and easier to evaluate) expressions for the density correlation representations can be obtained with a change of basis. Using orthonormal radial functions R n (x) ≡ x|n and spherical harmonics Y m l (x) ≡ x|lm yields a discrete set of coefficients that transform as spherical harmonics where nlm; r ji |g are functions of r ji , enumerated by the indices (n, l, m), that can be evaluated numerically or analytically, depending on the choice of basis (see Section VIII D for a few examples).
The use of spherical harmonics |lm for the angular basis makes it easy to evaluate the rotational integral of Eq. (17) analytically, because the matrix elements lm|R|l m = δ ll D l m m (R) correspond to Wigner-D matrices, an irreducible representation of SO(3). Well-known results from the theory of angular momentum, 119 such as the orthonormality and the product reduction formula for Wigner-D matrices, allow deriving explicit expressions for the symmetrized field representations of order ν = 1, 2, 3 where l 1 m 1 ; l 2 m 2 |LM is a Clebsch-Gordan coefficient. Much as it was the case for the real-space versions of the density correlation representations, there are several redundant indices in these expressions, resulting from the rotational averaging that leaves the m i as free parameters. We can then re-label the invariant features, in a way that emphasizes the connection to existing representations, by coupling the angular basis and absorbing some of the inconsequential constant factors. For the case ν = 1 one can define which corresponds to a discretized version of a pair correlation function g a (r): For the ν = 2 case, Eq. (24) can be redefined as This corresponds -modulo irrelevant constants -to the rotation invariant 3D shape descriptor 120 and to the SOAP features, which would be written, in the notation of Refs. 27,101 as where c a nlm = anlm|ρ i indicate the density expansion coefficients following the same notation. The ν = 2 representation can be written on a real-space basis as a 1 r 1 ; a 2 r 2 ; ω|ρ ⊗2 i , indicating explicitly its nature as three-body density correlation function that depends on two distances r 1 , r 2 and the cosine ω of the angle between the directions along which they are evaluated. The 4-body order invariant representation becomes and closely related to the bispectrum used in the spectral neighbor analysis method 121,122 , which is essentially equivalent to a different choice of basis. As discussed in more detail in Ref. 123 and in the next sections, the relationship between the redundant expressions (23,24,25) that arise from the integral over rotations, and the more concise versions (26,28,30) can be seen as a transformation from the uncoupled to the coupled angular momentum basis, and starting from the ν = 4 additional indices k ν must be included to account for the different ways the coupling can be realized. Taking the g → δ limit of ρ yields the atomic cluster expansion (ACE), with nlm|δ i being equivalent to the A inlm coefficients of Ref. 106, and the B indicates a group of basis functions indexed by k 1 , . . . , k ν . Through a further linear transformation (change of basis) made explicit in Refs. 107,124 the moment tensor potential (MTP) of Ref. 111 can also be related to this construction. However, the philosophy behind the density correlation features is different from that behind MTPs and ACE, in that these methods were at least originally thought of as bases for polynomial regression. While these basis functions can be equally used as symmetry-adapted features there are subtleties to be considered that we discuss in Sec. VI and in Sec. VIII B. Note that even though the contracted basis (a i n i l i k i ) i=1...ν | eliminates some of the redundant indices that are present in the tensor-product basis, the indices do not label a set of linearly independent features. Symmetries and selection rules -some of which, listed in Ref. 123, can be derived from results of angular momentum theory 125 -restrict greatly the number of independent entries that need to be computed. However, the non-trivial interaction between the radial and angular basis component makes this list incomplete. A mixed algebraic/numerical precomputation step can further reduce the required features 107 . Finally, the global SOAP-like descriptors introduced in Ref. 117, corresponding to Eq. (21), can be readily expressed in an angular momentum basis as where we recall that anlm|A; ρ ⊗2 = i∈A anlm|ρ i .

F. The density trick
A crucial point in comparing different representations is that with an appropriate discretization of the angular basis one can evaluate symmetrized highorder correlations as sum of products of the density coefficients defined in Eq. (22). This ensures that the cost of computing all coefficients of a given order ν, scales only linearly with the number of neighbors included within the cutoff around atom i, even though it scales exponentially with ν in terms of the number of basis functions, at least with a naive choice of basis. This is to be contrasted with atom-centered symmetry functions (ACSF), 19,115,116 and permutation invariant polynomials (PIP), 11 in which function are evaluated over all possible tuples composed of ν neighbors of the central atom (or on all the possible tuples in a structure to yield a global descriptor). In these frameworks, the cost depends linearly on the number of basis functions, but exponentially with ν in terms of the number of neighbors. This crucial difference makes density-expansion frameworks more convenient when one wants to ramp up the value of ν, and there are many neighbors. A priori sparsification schemes, exemplified in (95), and feature selection schemes, discussed in Section VIII B, allow one to keep only the most important basis functions can eliminate the exponential scaling with ν altogether.
Despite this rather fundamental difference in philosophy and computational cost, the two families of representations compute entities that are essentially equivalent, which we see by writing explicitly Eq. (28) in the g → δ limit as a sum over neighbors j and j a 1 n 1 ; a 2 n 2 ; l|δ ⊗2 By using the addition formula of the spherical harmonics we get the equivalent formulation (34) in which P l (ω) is a Legendre polynomial of order l. In Eq. (34), the ν = 2 density correlation coefficients are computed as a function of the distances and angles between triplets of atoms including the central atom i. By plugging this expression for a 1 n 1 ; a 2 n 2 ; l|δ ⊗2 i into Eq. (20), that evaluates the value of an arbitrary atom-centered symmetry function, one sees that this result is not specific to the choice of P l as angular functions: in the limit of a complete basis set, it is equally possible to compute any ACSF using a sum over neighbor tuples or a contraction of density coefficients, drawing an explicit link between the SOAP power spectrum features, 27, 101 Behler-Parrinello symmetry functions, 19,126 the DeepMD framework, 116 and FCHL features 97 . Similar expressions could be derived for higher-order atom-centered symmetry functions, showing the complete equivalence -but dramatically different computational scaling with the number of neighbors -of the two frameworks.

G. Equivariant representations and tensorial features
The previous construction is suitable to represent any rotationally-invariant atomic property. In many circumstances, however, one is interested in representing vector-valued or general tensorial quantities F. In this case, the prescribed transformations that the tensor undergoes under the symmetry operations of the O(3) group (e.g. F(RA) =RF(A)) have to be incorporated into the atomic representation in the form of equivariant, rather than simply invariant, features, so that the representation follows the same transformation as the target property, |RA =R |A . Equivariance can be enforced by comparing environments and defining the local contribution to the target relative to a pre-defined local reference frame, which has been used to build machine-learning models of tensorial properties in molecular systems [127][128][129][130] . A more general approach for achieving this goal consists in endowing the representation with the symmetries of spherical harmonics x|λµ = Y µ λ (x), as well as the * * * FIG. 7. Graphical representation of a SO(3) equivariant tensor product representation. Copies of the atomcentered density are evaluated at ν separate points, together with a set of spherical harmonics that provide a basis to expand the components of a tensorial property. The tensor product is averaged by simultaneously rotating all densities and the |λµ term.
desired parity under the action of the inversion operatorî, which we associate with a ket |σ such that i |σ = σ |σ . The eigenvalue σ is 1 for polar tensors, and −1 for pseudotensors. Features that transform as |σ ⊗ |λµ can be achieved by including two additional fields 90,131 within the symmetrized tensor product of Eq. (17), i.e., The operation is depicted in Fig. 7, showing how the λµ ket corresponds to the evaluation of a set of spherical harmonics that anchors the atom-centered density to a reference frame. The scalar and rotationallyinvariant case is recovered by taking |σ; λµ = |1; 00 . This construction represents a particularly convenient framework to target the prediction of any Cartesian tensor F in terms of its irreducible spherical components, 132 namely F σλ µ , that transform under rotation and inversion as Within a linear regression model, they can be written as the combination of equivariant representations of the proper order λ and parity σ with a set of rotationally-invariant weights Q|F; σ; λ : Each irreducible spherical component of F gives rise to a separate equivariant model, and the appropriate transformation rules are ensured by the fact that each equivariant feature Q|A; ρ ⊗ν i ; σ; λµ; separately transforms as the spherical harmonics |lm . Much like the case of invariant symmetrized fields features, Eq. (35) can be most effectively computed by first expanding the atom-centered field on a basis of spherical harmonics. An equivariant extension of the atomic cluster expansion 124 or the moment tensor potentials can be recovered taking the g → δ limit.
A concrete example of these features is given by the density coefficients themselves: in fact, one can see that the ν = 1 equivariant reads simply Note how in the bra-ket notation the (λ, µ) indices on the two sides of this equation carry a different meaning. When used in the bra of the density expansion nλµ|ρ i , they identify one of many rotationally variant components; there is no explicit link to their symmetry properties, and one could build a rotationally variant models by selecting only some of the µ values for a given (n, λ). When used in the ket of an equivariant feature n|ρ ⊗1 i ; σ; λµ , they label groups of features that should be taken together, because they transform in a specific way under the symmetries of the O(3) group. It is easy to see that by using n|ρ ⊗1 i ; σ; λµ features in Eq. (37) one obtains a model that fulfills (36), because acting on the ket witĥ R yields a product with the associated Wigner matrix -with the caveat that pseudotensors cannot be described by ν = 1 features. Scalar products of these equivariant features generate kernels that are suitable for symmetry-adapted Gaussian process regression 133,134 , e.g. λ-SOAP kernels of the form which generalize the covariant kernels used in Ref. 135 to learn interatomic forces. The fact that equivariant features of the form (35) follow O(3) transformation rules means that they can be combined using established relationships in the quantum theory of angular momentum. In particular, the coupled-basis representation used in the definition of Eqs. (26)(27)(28)(29)(30) can be formulated for an arbitrary value of ν, and in this form it is possible to express succintly 123 a recursive formula to evaluate |ρ ⊗ν i ; σ; λµ based on lower order terms: . . . n ν l ν k ν ; nlk|ρ For ν = 2, the recursion yields the original expression for λ-SOAP equivariants 133 Similar recursive expressions have been independently proposed to efficiently compute invariant features 107,111 , that can be obtained by taking |σ; λµ = |1; 00 in Eq. (40). The possibility of combining equivariant features using angular momentum rules is also exploited in the construction of covariant neural networks 118,136 It is also worth mentioning that it is possible to build models that are imbued with the appropriate transformation properties in an indirect fashion, by learning atom-centered scalars and combining them with the atomic positions to evaluate formal (or actual) molecular multipoles. This is easily seen for the case of the dipole moment of a neutral molecule, that can be computed as Models of this form have been used since the early days of the construction of molecular potential and dipole moment surfaces 54,138 , combined with neuralnetwork potentials to compute IR spectra in the condensed phases 139 , and more recently combined with tensorial models, to describe the interplay of atomic charges and polarization contributing to the total dipole moment 140 . Assigning constant formal charges q i to atoms has also been used to derive covariant kernels, in the so-called operator machine learning framework 141 , which is also similar in spirit to the tensorial embedded atom neural network 142 . The gist of the idea (although expressed in a feature rather than kernel (or NN) language) is that one can define a translationally-invariant representation that depends formally on an applied electric field, e.g.
Deriving with respect to one of the components of E brings a dependency on the corresponding component of r ji , that upon rotational averaging (keeping in mind thatR acts on atomic coordinates and not on the external field) plays the same role as |λµ in Eq. (35), providing a basis of features that can be used to learn vectors covariantly. The use of local interatomic vectors to build a covariant reference system is similar to the approach adopted in Ref. 143 to define a general atomic neighborhood fingerprint, and in Ref. 144 to machine-learn the position of electronic Wannier centers in condensed phases. Despite the superficial similarity with the environment-dependent point-charge model of Eq. (42), this scheme is more closely related to a framework based on atomic dipoles, since its predictions can be decomposed as a sum of atom-centered equivariant terms.

H. Long-range features
Introducing a cutoff in the definition of the local density is not only necessary to reduce the cost of evaluating the expansion coefficients, or the number of terms that have to be included to obtain a converged expansion of the density correlations. Increasing the range of the environment makes the model more complex, which often results in slower learning when limited training data is available. 28 One pragmatic solution is to build models that explicitly incorporate a physically-motivated functional form as a baseline, which could take the form of an existing model 147,148 , an electrostatic scheme based on machine-learned partial charges 139,149,150 or atomic multipoles 128,151 . A representation that aims to describe, in a data-driven manner, properties that depend on multiple length scales, should be similarly multi-scale in nature. This idea has been implemented by combining local representations with different cutoffs 28 , scaling atomic contributions according to distance 97,105 (as we discuss in more detail in Section VIII C), treating separately intra-and inter-molecular correlations 152,153 , as well as by building global structural representations based on an intrinsically multi-scale wavelet scattering transform 154 . An alternative approach, that provides a good example of how the symmetrized atomic field construction can be extended beyond the use of an atomic density as the starting point, defines a Coulomb-like potential field based on the smoothed atomic density (Figure 8) Symmetrizing this field in the same way as for |ρ leads to an atom-centered potential, where we introduce the short-range density |ρ < i , restricted to the region within the cutoff, and the farfield density |ρ > i , restricted outside the cutoff, and the corresponding local and non-local fields |V < i and |V > i . Crucially, these features incorporate information on atoms outside the cutoff, yet their complexity can be kept under control by restricting the range of the spherical environment over which they are computed. Just as for |ρ i , the ket can be discretized by expanding it on an orthogonal basis of radial functions and spherical harmonics to obtain anlm|V i . One can then build features that are fully equivariant by averaging |V i over the symmetry operation of the O(3) group, leading to ν-point correlations analogous to those discussed above. Furthermore, one can combine local and long-range fields, as in Figure 9, constructing a family of multi-scale longdistance equivariants (LODE) features 137 of the form The simplest multi-scale representation |ρ i ⊗ V i can be linked to physics-based models using an atomcentered multipole expansion of electrostatic interactions, as we discuss further in Section V B, but are effective to learn a multitude of long-ranged interactions, from permanent electrostatics, to polarization and dispersion. Figure 10 shows that, when trying to represent long-range interactions between molecular fragments, a model based on local |ρ ⊗2 i features produces a completely unphysical behavior, with the interaction reaching a plateau when the molecules are separated by more than the cutoff distance. Multiscale LODE features, instead, can describe the asymptotic tail even when using a 3Å cutoff in the definition of the atom-centered environments, and are capable of representing interactions of very different chemical nature. Using a non-local field as the starting point of the symmetrization procedure provides interesting opportunities to incorporate long-range, many-body interactions in atomistic machine learning.

V. REPRESENTATIONS AND MODELS
Even though this review focuses on the problem of representing atomic structures in terms of a vector of features, one cannot ignore the intimate connection between the choice of features and how they are used to construct representations of symmetric properties, such as site energies, which are then used in the context of regression schemes. 10,20,70,78,106,111,118,133,137,[155][156][157] The purpose of this section is therefore to discuss the interplay between representations and models. Given a set of symmetric features q|A i of an atomic environment A i , we explore how to use it to represent a symmetric property F (A i ). We discuss linear approximations, and show that the family of features we introduced in Section IV lead to natural generalisations of wellestablished models of interactions between atoms and molecules in terms of a body-ordered expansion. These relatively simple models put stringent requirements on the quality of the feature sets. We then go on to review how highly non-linear models may provide more flexibility in describing the relationship between a structure and its properties, and yield satisfactory results even with a rather simple, imperfect choice of features. Here, and in the following, we always understand implicitly that equality in these approximations can only be attained in the limit of an infinite cutoff radius and suitably converged parameterisation.

A. Linear models and body-order expansion
An advantage of linear models is that they can often be connected to classical physics-inspired frameworks, and bring to light physical-chemical insights. An example of this connection involves the construction of interatomic potentials in terms of a bodyordered hierarchy of atom-centered energy terms in which each term can be written as a sum over ν neighbors of the central atom This kind of expansion underlies the vast majority of empirical force fields, that are customarily written as a combination of pair potentials, and short-range 2, 3, and 4-body bonded terms. Most potentials truncate this expansion at bodyorder three, i.e. ν = 2 -a notable exception being the dihedral angle potentials used in force fields, that are four-body but involve selected groups of atoms rather than a sum over all possible triplets. This is because the cost of a naive evaluation of the sum j1<···<jν scales exponentially with the body order ν, i.e. as O(N ν i ) for an environment containing N i atoms. More sophisticated ways of symmetrizing the body-ordered terms, such as those discussed in Refs. 158 and 78, alleviate this behavior. In the following paragraphs we demonstrate, in particular, how this exponential scaling can be overcome by using the density correlation representations discussed in Section IV.
The three-body case. It is illuminating to first discuss in full detail the representation of a 3-body site potential, written traditionally in internal coordinates, in the form (50) where ω ijj :=r ji ·r j i . In order to connect to the atomic density correlations we first rewrite this as adding and subtracting a self-interaction from the 3body term.
Approximating u (2) (r) in terms of a radial basis R n (r) = r|n yields where |δ i is the g → δ limit of the atom-centered density |ρ i . For the three-body term we revisit (20): first, we approximate u (3) in terms of the radial basis R n (r) = r|n and the Legendre polynomials P l (ω) = ω|l , Applying Legendre's addition theorem to expand the P l in terms of spherical harmonics Y m l (r) = r|lm , absorbing the 4π 2l+1 into the weights u (3) nn l and reordering the summation yields Finally, we sum over all (j, j ) and reorder the summation to arrive at In summary, we have written an arbitrary 3-body site potential in terms of 1-and 2-correlations of the atomic density, Aside from connecting classical body-ordered interatomic potentials and ν-correlations of the atomic density this formulation has significant advantages in terms of computational complexity which we discuss below after generalising the argument to arbitrary body-order.
General (ν + 1)-body order potentials. The systematic expansion to arbitrary body orders has been applied to the description of alloys in terms of a cluster expansion, a procedure that was very early shown to provide a complete description of the problem 159 , to the rationalization of fragment-based electronic structure methods 160 , and to the construction of last-generation potentials for water and aqueous systems 147 .
We adopt the generalisation of (51) that includes self-interaction, which can be obtained from the more natural formulation (49) by correcting the self-interaction terms into the ν-body-order energy similarly to Eq. (51).
To connect (57) to the density correlations we represent the rotationally invariant (ν + 1)-body function u (ν+1) as where we use Q as a shorthand for (x 1 ; . . . x ν ), so that Q|r j1i , . . . , r jν i ≡ ν k=1 δ(x k − r j k i ). The rotation can be made to act on the atomic positions or on the basis, depending on convenience. The (ν + 1)-order site energy is obtained by summing over clusters of neighbors The symmetrized sum can be reordered to show that it corresponds to the ν-point density correlation which is precisely Eq. (18) written in the g → δ limit. Thus we have explicitly represented E (ν+1) in terms of the symmetry-adapted density correlations. We emphasize again that this calculation required the inclusion of the self-interactions as the starting point (57).
For a practical implementation we can choose a finite, discrete basis, approximating E (ν+1) as Any complete implementation of ν-order density correlation features 106,107,111,123 provides a basis to expand u (ν+1) and approximate a (ν + 1)-order term, that contributes to the body-ordered expansion of E(A). The foregoing discussion shows that these bases are complete in the following sense. An (infinite) collection of symmetrized features { q|A i } q∈q total is a complete linear basis if there exists a sequence of finite subsets q ⊂ q total such that i.e. F q approximates F to within arbitrary accuracy in the limit as the number of features tends to infinity. We stress here that the weights F |q depend on the entire choice of feature set q and not just the single index q. Therefore the density correlation features provide a universal, complete linear basis to approximate body-ordered potentials and, more generally, body-ordered expansions of properties that can be meaningfully written as a sum of atom-centered contributions.
For the specific choice Eq. (61) is the ACE model 106,107 . Note that the symmetrized correlations q|δ ⊗ν i can be efficiently and conveniently evaluated as already hinted at in Section IV E. Since MTPs provide an alternative basis set for the same space, they are complete as well, and in the same sense. We also emphasize that a rigorous proof of completeness of MTPs was already given by Shapeev 111 . The "density trick", i.e., expanding in terms of the density correlations, ensures linear scaling in terms of the number of neighbors N i rather than the Ni+ν ν scaling of the naive representation (49), which enables modeling very high body-orders. A recursive evaluation of the ν-correlations implemented by the MTP and ACE bases, or by the NICE formalism, avoids an unfavorable scaling of the evaluation of the high-order terms (see Section VIII D for a summary of these techniques).
Density smearing. The real-space view of the density correlation features may be more intuitive when considering finite smearing of the atomic contributions to |ρ i , that gives rise to a smooth function that can be seen as a proxy for the electronic density, and is reminiscent of the atoms-in-molecules 161 description of the electronic structure of a molecule or a condensed-phase system as a collection of atomcentered densities.
Since we derived the link between density correlations and body-ordered potentials, and in particular the proof of the completeness of the linear expansion, only in the limit of a sharp density we now discuss whether a similar formal guarantee holds for a general |ρ i , admitting in particular smearing of the atomic contributions. With tensor-product bases, all statements derived for higher correlation orders can eventually be reduced to a one-dimensional description, that is sufficient to reveal the essential features of the problem.
Given a function f (x) which can be expanded into polynomials p n (x) = x n : we want to check whether we can also represent f in terms of smeared polynomials, (65) For this particular choice of Gaussian smearing we can evaluate this expression explicitly and obtain p σ n (x) = x n + lower order terms, i.e., p σ n is in fact still a polynomial with leading-order term x n and this means it forms a basis. In particular we can now again represent f nmax (x) exactly as And in the limit n max → ∞ we recover f . In the more general case, suppose that we have an arbitrary complete basis {φ j }. Then we can approximate x n ≈ j b nj φ j (x). The smearing operator g * · is bounded, which allows us to write Given that p σ n are dense, it follows that also the smeared basis functions = φ σ j = g * φ j are dense. From these arguments it is reasonable to conjecture that the smeared density correlations also form a complete linear basis. There is however an important caveat: The inverse of the smearing operator is unbounded, which implies that the coefficients of the expansion of f in terms of the smoothed polynomial basis necessarily blow up when the order is increased, even if f has a stable expansion in a polynomial basis. Therefore, in practice, the smoothing of the density, the truncation of the basis, and the regularisation of the regression, must be carefully coordinated and adapted to the natural scale of the variations of the target function f , i.e. to its "natural" smoothness.

B. Long-range features and potential tails
A similar formal correspondence with wellestablished functional forms of physical interactions can be derived when using (scalar) multiscale LODE features (46) within an additive, linear learning model, using as target the electrostatic energy U (A), The fact that the representation is linear both in the density and in the potential fields allows one to derive rigorous asymptotic relationships for the interaction between two distant portions of the system, that resemble the electrostatic interactions between the multipoles of a localized charge density distribution and any other charge that is located arbitrarily far away. 137 Focusing only on the long-range contribution U > to U (A i ), that is associated with the part of |A; V i generated by the far-field density, |A; V > i , one can write In this expression, in which the reader can recognize the similarity with the multipole expansion of the electrostatic potential, 132 |ρ > i indicates the atom density outside the cutoff, which is not computed explicitly but is encoded in the expansion of the local atomic potential (44). The coefficients a 1 a 2 ; lm|M < i (U ) can be written as a combination of the regression weights a 1 n 1 ; a 2 n 2 ; l|U and the local density coefficients anlm|ρ < i , and can be interpreted as adaptive multipole coefficients that depend in a general manner on the atomic distribution within the environment.
Given that the atomic densities and potentials are not the physical charge density and electrostatic potential of the system, it is the role of the regression procedure to modulate the multipoles so as to reproduce the reference data for the electrostatic energy. In Fig. 11 we report an example where this is demonstrated by extrapolating the long-range interaction between a pair of rigid H 2 O and CO 2 molecules, upon training the multiscale LODE model on the longrange, yet not asymptotic, interaction profiles associated with 33 different reciprocal orientations of the two molecules. The figure compares the asymptotic extrapolation performance upon centering the representation on different atomic centers, as well as by truncating the angular expansion at different l max . It is apparent that the angular cutoff chosen reflects the number of multipoles introduced in the expansion of Eq. (70) and thus determines sharp crossovers of the prediction accuracy across critical l max values. For instance, a model that uses only features centered on the oxygen atom of H 2 O improves dramatically its performance when l max is increased from zero to one. A model using the carbon atom of CO 2 as the only environment shows a similar, sharp improvement in accuracy when going from l max = 1 to l max = 2. This is consistent with the primarily dipolar nature of the electrostatic field generated by a water molecule, and with the quadrupolar nature of the center-symmetric carbon dioxide. Even though this example showcases the link between a linear model based on |ρ i ⊗ V i and multipole electrostatics, the representation is sufficiently flexible to describe also other kinds of interactions, as demonstrated in Figure 10.

C. Non-linear models
Historically, linear representations used basis sets in internal coordinates (typically interatomic distances or simple transformations of them) that exploded in size with body order, see e.g. Refs. 75,158, and with exponential scaling in their computational cost of prediction due to the need to sum over all ν-clusters in a configuration or atomic environment. Moreover, it is clear that high body orders would be needed to obtain the desired accuracy, especially for models of materials. About a decade ago, nonlinear fits using low body order (ν = 2) descriptors appeared, with the surprising result that a few hudrend degrees of freedom were enough to get good potentials 10,12 . Contrary to linear modeling where the symmetry-adapted features q|A i are used as a basis, in the context of non-linear regression, they are best thought of as a coordinate transformation φ(A i ). In a linear setting the choice of a basis, and the details of the implementation, are a matter of computational performance but can be converged to a well-defined, basis-set independent limit. When used as the input of a non-linear model, instead, the entries of the feature vector φ q (A i ) = q|A i must always be precisely defined, because there is no complete basis set limit in which the models become equivalent. If F (A i ) is a symmetric property such as a site energy, we aim to construct approximations of the general form The two most commonly used models forF are artificial neural networks 19,20,88,126,139,156,162,163 (ANN) and kernel ridge regression 21,22,28,97,151,157,[164][165][166] (KRR) models. In KRR models, 167 one builds a kernel matrix K with elements which provides a similarity measure between the environments A i and A j , measured in terms of the similarity between the corresponding feature vectors φ(A i ) and φ(A j ). Useful kernel functions, K, are nonlinear, e.g. polynomials, Gaussians, etc. 168 . The kernel inherits the symmetry of the feature vectors, and therefore a model for a symmetric property F (A i ) can be obtained as where, in the simplest setting, the M j are scattered interpolation points, but more generally are simply a collection of "centers" which induce a basis {K(·, φ(M j ))} j in the symmetrized feature space. The weights c j are then obtained by a linear regression. Kernel models have two main advantages over "naive" linear regression using the same features.
(1) They introduce implicitly a non-linear mapping between the inputs and a "reproducing kernel Hilbert space" |A i → |A i ; K , which has a larger (often infinite) dimensionality, allowing for a more flexible approximation of F (A i ). (2) Given that the basis is centered on the training points, it is adapted to the geometry of the data set in feature space. For example, if the centers |A i ; K in feature space fall on (or close to) a low-dimensional manifold then the KRR model naturally exploits this. For a comprehensive discussion of the use of kernel methods in atomistic modeling, see Ref. 169. In the context of body ordered features discussed above, the non-linearity in the kernel effectively increases the body order of the features used in the regression model, but in rather special way: only those high body order terms are present that can be obtained as functions of low body order features. See Section VI on completeness for a more detailed discussion. While nonlinear models are by their very nature more flexible in representing complex highdimensional features, linear models come with different advantages. As we have shown in Section V A and Section V B, they tend to be more easily "interpretable", e.g. in terms of a body-ordered expansion of the target properties, or in terms of physicallymotivated asymptotic forms of the interactions. But are nonlinear models necessary to achieve high accuracy? This notion is challenged by the SOAP, 27 the SNAP, 121,122 , the MTP, 111 , the ACE 106 and the NICE 123 representations: the "density trick" and its generalisations to higher body-orders, replacing polynomials with correlations of the atom-centered density, circumvents both the explicit symmetrization as well as the summation of all ν-clusters of traditional body ordered expansions. Particularly when using density correlations above ν = 2, it is critical to fully exploit the computational cost gains offered by permutation symmetric properties. Even if one were to initially specify a model in terms of the "natural" bodyorder expansion (49), one should convert it for computationally efficient evaluation to a representation in terms of ν-correlations, e.g. within the ACE 106,107 , MTP 111 or NICE 123 frameworks. By employing the recursive evaluations introduced in Refs. 107,111,123, this transformation makes it possible to truncate at very high body-orders without significant penalty in computational cost, as discussed in more detail in Section VIII D. As an illustration of how a linear fit based on high-quality density-correlation representations can compete with non-linear models we show in Fig. 12 the learning curves resulting from the regression of the atomization energy for a very large and geometrically diverse database of CH 4 configurations (generated by randomly displacing the H atoms around the central carbon, in a sphere with a radius of 3.5Å). The plot reflects a tradeoff between model complexity and the availability of training data. Saturation of the learning curves indicates that the model does not have sufficient flexibility to describe fully the underlying structure-property relations. 28,103 Thus, linear models based on NICE features incorporating higher and higher body order are capable of describing the structure-property relations to a higher degree of accuracy, which is apparent in the delayed saturation of the learning curve. One sees that a ν = 4 model starts saturating around n train = 10 5 , even though the system is composed of 5 atoms, and so the body-ordered expansion should be fully converged. This is because a linear model requires a complete basis, while here we select only a few 1000s invariants at each body order. A NN model can be designed to be more flexible and beat this saturation, at the expense, however, of performance in the small data set limit -which, in a more chemically diverse regression exercise, usually translates to poorer transferability.

VI. GEOMETRIC COMPLETENESS OF REPRESENTATIONS
In Section V we discussed two classes of models built from invariant features: linear models, for which the representation plays the role of a basis to expand the target property, and nonlinear models, where the representation φ(A i ) = { q|A i } q plays the role of a coordinate transformation generating a finite-dimensional feature vector used as the argument of a non-linear function. In order to guarantee systematic convergence of these models in suitable limits we require that the employed set of features is complete. We already hinted in Section V C that these two scenarios lead to different requirements on the notion of completeness. In this section we provide a more in-depth discussion of the completeness issue in the nonlinear setting, and point out open problems.
Suppose we are given a finite collection of symmetry adapted features φ(A) which we wish to use as a descriptor for atomic structures or environments, for example symmetrized correlations of the density as described in foregoing sections. We say that this descriptor is geometrically complete if the mapping A → φ(A) is injective. Colloquially, this means that any two atomic configurations that are not related by symmetry are mapped to different descriptors. The consequence of this definition is that, if a feature vector φ(A) is geometrically complete then a symmetric property F (A) can be written as whereF is now a generic non-linear function in feature space which can be constructed by any arbitrary regression scheme. We then say that the feature vector φ(A) is algebraically complete, which is a less stringent requirement than the linear completeness discussed in Section V A.

A. Algebraic completeness: an elementary example
The ideal goal would be to have complete finite feature sets, that allow to approximate any symmetric function of the coordinates to arbitrary accuracy. As an elementary introduction to how such a construction might be achieved in principle, we consider a collection of N particles in 1D, {x i } N i=1 . As a concrete example, one can take two particles with positions (x 1 , x 2 ). In the absence of an angular component, we only need to consider the projection of the density ρ(x) = i δ(x− x i ) onto the monomial basis x n : For example, if N = 2, 1|ρ = x 1 + x 2 , 2|ρ = x 2 1 + x 2 2 , etc. In this simple setting, one sees easily how the ν-point density correlations form a basis of symmetric polynomials (76) which is linearly complete because it contains all possible symmetrized monomials. In analogy to what we did in Section V A, we use the "self-interaction" formulation in which the sum extends over all the tuples of particle indices. For the case of two particles, linear combinations of n 1 n 2 |ρ ⊗2 = x n1+n2 are sufficient to write any symmetric polynomial of the particle positions.
Thus, if we allow for algebraic operations on the n|ρ , it is clear that the ν = 1 coefficients provide a sufficient basis, because the elements of the linear basis (76) can be obtained as a product, e.g. n 1 n 2 |ρ ⊗2 = n 1 |ρ n 2 |ρ . In fact, well-established results from the theory of symmetric polynomials 170 allow making an even stronger statement. The first N power sum polynomials ( n|ρ ) N n=1 provide an algebraically-complete basis to write any symmetric polynomial function of the coordinates of N particles. For instance, for N = 2 we can express the n = 3 term as a polynomial of 1|ρ and 2|ρ This result implies, in general, that the mapping is injective: knowledge of the first N features n|ρ allows us to uniquely reconstruct the configuration (but not the index of the atoms).That is, this minimal feature set φ(A) is indeed geometrically complete. It is not too difficult to construct similar complete and finite feature sets for finitely many particles in two and three dimensions as long as only permutational symmetry is considered. However, incorporating also rotational symmetry into the equivalence of particle configurations makes this much more challenging as we discuss next.

B. Geometric completeness of density correlations
In general, for three-dimensional atom configurations it is clear that taking all ν-correlations provides a complete set of features (after all, they are even complete in the linear sense), however, this is not useful in practice. As we explain next, it remains an open problem how to construct a minimal complete feature set in this general setting.
It is clear just based on dimensionality arguments that a descriptor that has fewer than 3N − 6 components (the number of elements in the Cartesian position vectors, subtracting the degrees of freedom associated to translations and rotations) cannot be complete for N particles. On the other hand, the descriptors based on ν-point correlations have a number of components that scales with N ν . But having more than the necessary minimum number of components does not ensure that a descriptor is complete.
Although it was appreciated for a long time that two-correlations for entire structures are not complete, i.e. knowing the set of distances between points is not enough to reconstruct the point set 27,102,171 , it was not until recently that the connection to environment descriptors was made 104 . The fact that degenerate pairs of inequivalent environments mapping to the same descriptor exist for two-correlation (distance-angle) representations came as a surprise because so many "successful" models for potential energy surfaces have been published based on such descriptors in the past decade 14,169 . An example of such "degenerate pair" is given in Figure 13. The construction involves an environment with four neighbors on the unit circle, with the two structures corresponding to the labels  (1, 2, 3, 4) and (1, 2, 3, 4 ) being different, but having the same unordered list of distances and angles. The total number of degrees of freedom for this layout is three (because one neighbor can be fixed on the x axis), and there is one degree of freedom in the construction of the degenerate pair (the angle labelled α in Fig. 13). Thus, this manifold of pairs of degenerate configurations has a codimension of two, i.e. it has a dimensionality that involves two fewer degrees of freedom than the total. A more general construction, that yields a family of 3D degenerate pairs including an arbitrary number of neighbors, is discussed in Ref. 104. The fact that the degenerate pairs form a manifold does not mean that there is a degenerate manifold, i.e. a manifold of configurations all mapping to the same descriptor. This type of degeneracy occurs between pairs of configurations which are typically far from one another, and so this degeneracy problem differs from that of assessing the sensitivity of a representation to small atomic displacements 172,173 . As was shown in Ref. 104, in order to break this degeneracy, the correlation order has to be increased. Three-correlations ( |ρ ⊗3 i , equivalent to the unordered set of central tetrahedra, and the bispectrum of the atomic density) indeed distinguish environments such as those in Fig. 13. It is however possible to build pairs of environments, composed of 7 or more neighbors, which are distinct but have the same threecorrelations. This example, also discussed in Ref. 104, raises a number of open mathematical questions: (i) is the ν = 3 descriptor complete for N < 7 neighbors, (ii) are all ν-correlations degenerate for sufficiently many neighbors, (iii) what is the codimension of the manifold of degenerate configurations for ν > 2?
The concept of completeness applies both to representing entire structures and to atomic environments, but the relationship between these two cases is subtle. Given an entire structure, it can be considered to be the "environment" of the point at the origin, and the same symmetries apply. However, specific representations appear differently in the two views. For example, the two-correlation for an entire structure, the set of interparticle distances, when considered as an environment of the origin is the two-correlation of neighbors, and we naturally think of it as a three-body correlation, i.e. a set of distances from the origin and central angles. Note that the problem of completeness for entire structures is exactly the same as the problem of reconstructing point sets 102 .
One way to break the degeneracy between the representations of two entire structures involves combining information on different environments. For instance, one can describe the entire structure using an additive combination of atom-centered features analogous to Eq. (15). Following the above reasoning, a pair of environments that are degenerate in terms of the list of distances and angles are also structures that are degenerate in terms of the list of distances. However, these structures are not necessarily degenerate in terms of the combined list of distance and angle histograms of each local environment. Thus, taking non-linear transformations of atom-centered features cannot resolve the environment-level degeneracies, but can provide a way to differentiate entire structures. 104 The construction of injective yet concise representations for environments and structures is still an open problem, whose solution may help to improve the accuracy and computational efficiency of machine-learning models.
Note that in this discussion we are implicitly taking atomic structures related by symmetry as identical, and we focus on whether the injectivity holds for the domain of the descriptor map being the original atomic structures. The case of whether the same consideration hold for general scalar fields (e.g. those arising in the LODE construction) is a separate problem. For the case of translation symmetry (torus geometry) it is well-known that no finite correlation order suffices to reconstruct all signals 174 , however most signals can be reconstructed already from the bi-spectrum (ν = 3). To the best of our knowledge it is an open problem whether analogous results hold for the case of rotational symmetry of 3D spherical geometry 175,176 .

C. Spectral representations
As we explained above the set of all (N − 1)correlations is complete for N particles, because it is equivalent to the completeness of polynomial basis sets such as MTP 111 , PIP 158 , aPIP 78 , ACE 106,177 and NICE 123 (see also Section V A). Any of these bases can be expressed in terms of the ν-correlations via a linear transformation, and vice-versa. Even for fixed maximum polynomial degree, these are enormous representations. Depending on how ν-correlation features are chosen their number might scale as rapidly as qmax+ν ν , where q max is the number of one-particle features.
There is a class of much lower-dimensional descriptor maps based on the eigenspectra of overlap matrices 86,87 that lifts the degeneracy for the known examples, although their actual completeness is unknown. The construction of these "spectral representations" proceeds as follows: First, one constructs an artificial overlap matrix based on the positions of atoms within the i-centered environment A i : where t : R → R. Then, one computes the ordered spectrum (τ k ) N k=1 of T . If T is invariant (or covariant) (τ k ) k is an invariant descriptor of A i . Due to eigenvalue crossings, the mapping A i → (τ k ) k is nonsmooth, hence one should project it on a smooth basis, e.g. polynomial, The features (or, fingerprints as they are also called 87 ) (φ k ) N k=1 correspond to the moments of the histogram of eigenvalues, and contain precisely the same information.
An alternative way to write φ k is which is not computationally more efficient, but highlights the close connection between (φ k ) k and the body-ordered features we discussed in previous sections. From (81) we observe that and so forth. That is, the feature (fingerprint) φ k contains the projection of the histogram of k-simplices onto a single basis function. In other words, the cutoff function f cut and the overlap function t play the role of R n and P l in (34). Thus, the φ k provide invariant high body-order features at relatively low computational cost.
If one takes t to be scalar (as we have done here) then there are at most N invariant features for N neighbors, but 3N − 6 independent coordinates -so that the spectral features (80) must be grossly undercomplete. This source of incompleteness is easily lifted by simply taking multiple overlap matrices with different t functions, or taking t to be matrix-valued, as done in Ref. 87. However, even with that modification in mind, it is not at all understood whether these features are complete or can be made complete with limited modifications. For example it can be shown 107,123 that most high-body order features are actually polynomials of low body-order features, which means that they do not contain genuine high correlation information. This can be observed very easily with a seemingly trivial modification to the spectral representation construction. Consider N particles on the unitcircle at positions r ji , as in Fig. 13. In particular we then have only N − 1 independent variables, which means that a scalar t is in principle sufficient to identify the configuration. However, choosing T jj := cos θ ijj it is straightforward to see that the two overlap matrices T for the two configurations of Figure 13 have eigenvalues {0, 0, 1, 3}. That is, this particular choice of spectral descriptor is unable to distinguish them nor any two configurations for different α.
Even for a general atomic environment, T jj = r ji r j i cos θ ijj is the Gram matrix of the environment, which has at most three non-zero eigenvalues -and hence the collection (φ k ) N k=1 contains at most three independent features even though formally, φ k = Tr T k has body-order k. For a configuration in which the neighbors lie on a sphere, this case can be written as an overlap matrix by choosing an appropriate, monotonically decreasing t(r jj ), and for the general case with an appropriate (albeit contrived) choice of f cut and t. The purpose of these examples is to highlight that, although spectral descriptors offer some attractive features such as their computationally cheap high body-order nature, understanding under which conditions they are complete is subtle and requires a much deeper investigation.

VII. REPRESENTATIONS, STRUCTURES, PROPERTIES AND INSIGHTS
A mathematical representation of the structure of an atomic configuration is not only useful as the starting point of supervised-learning algorithms, aimed at predicting its energy and properties. It can also be used, in combination with unsupervised learning schemes, to compare structures in search for repeating atomic patterns [179][180][181][182][183][184][185][186][187][188][189][190] , to obtain lowdimensional projections that help visualize complex datasets 1, 4,6,[191][192][193][194][195] , and more generally to describe the lie of the land in (free)energy landscapes and interpret structure-property relationships in complex systems [196][197][198][199] . There is a long-standing tradition of developing domain-specific descriptors to use in the automatic analysis of structural data. For instance, simulations of polypeptides have been interpreted in terms of backbone dihedral angles 200 , discrete secondary-structure categories 201,202 , as well as sophisticated continuous fingerprints of secondary structure and backbone chirality 203,204 . Simulations of clusters and condensed-phase systems have often used more general indicators, such as Steinhardt order parameters 205 , cubic harmonics 206 , radial distribution functions (either directly 207,208 or in the form of entropy-inspired fingerprints 209 ), histograms of coordination numbers 4 , that can be seen as precursors of the atom-density correlation representations that we discuss in Section IV D. More broadly, generalpurpose descriptors that can be understood, more or less transparently, as a special case of the densitycorrelation features |ρ ⊗ν i have been developed and used in unsupervised-learning contexts as much as in the context of regression models. A few examples include the diffraction-based fingerprints of Ziletti et al. 210 , the local order metric of Martelli et al. 211 , the spectral representations of Sadeghi et al. 86 , the Minkowski structure metric of Mickel et al. 212 (that closely resembles and anticipates the construction of the moment tensor potentials).
Understanding the way a representation converts the Cartesian coordinates of atoms into features is necessary to make sense of any subsequent analysis, because any explicit or implicit assumption made in the structure-feature map will be reflected in the unsupervised analyses based on those features 213 . An example of this is given in Figure 14, that shows the effect of using rotationally variant or invariant features (respectively, nlm|ρ i and n 1 n 2 l|ρ ⊗2 i ) to analyze a simulation of undercooled iron 178 . Atoms are colored according to a two-dimensional projection describing the associated environments, in this case obtained using a kernel principal component analysis 214 built on the feature vectors φ(A i ). Using orientationdependent features makes it possible to distinguish more clearly the presence of multiple grains, and would be useful, for instance, to investigate the texture of the nanocrystalline sample, much like one would do with an electron backscattering diffraction analysis. Using invariant features highlights that all nanocrystals have the same structure, and makes it possible to recognize the disordered environments at the grain boundaries. Conversely, investigating the effect of different representation choices on the unsupervised analysis of a well-understood system can be useful to better appreciate the relation between structure and features.
In this Section we summarize recent developments, and identify clear insights, related to the use of representations to determine the similarity between structures, to perform clustering and dimensionality reductions analyses, and to build models that go beyond the injective structure-property map that we have used this far.

A. Features, distances, kernels
Before delving into the use of structural representations to visualize and classify atomic configurations, let us recall the link between feature vectors φ(A i ), that are associated to structures or environ-ments, and distances or kernels, that express the relationship between two of these entities. For example, given a feature vector Φ, it is possible to define a distance using e.g. a Euclidean metric, 2 , and use as a kernel the scalar product K(A i , A i ) = φ(A i ) · φ(A i ), or a non-linear function, e.g. an exponential of a squared distance The opposite is also true: for a given set of configurations M , and any (negative definite) distance or (positive definite) kernel 215 it is possible to construct a set of features that generate the kernel by taking their scalar product -a practical implementation of the concept of reproducing kernel Hilbert space that underlies kernel methods 214 . One only needs to construct the kernel matrix K ij = K(M i , M j ), and find its eigenvalues and eigenvectors Ku (j) = λ j u (j) . It is easy to see that the scalar product between the reproducing features computed for two members of the reference dataset yields exactly the value of the kernel function between the two configurations.

B. Measuring structural similarity
Most unsupervised learning algorithms rely on the definition of a metric to tell apart structures depending on their similarity. At the most basic level, a metric that is capable of identifying identical structures is extremely useful in all the applications that aim at automating the search of materials or molecules with desirable properties 216-220 . This is not an entirely trivial task: in molecular searches, a mismatch in the simple ordering of atomic indices can lead to the failure of metrics based on the alignment of conformers, such as the root mean square distance (RMSD), and the exact calculation of a permutation invariant version would involve combinatorially increasing computational effort. 86 . In the case of condensed phases, one needs to deal with the problem that the same periodic structure can be described by different choices of unit cell size and orientation. The requirements for a metric to compare atomic structures are similar to those discussed in Section III, and have been discussed in great detail in Ref. 86: a good metric needs to be invariant to rotations, translations, and permutations 221 , and still be capable of telling distinct structures apart 87 . The comparison between the resolving power of different metrics has been often determined using distance-distance correlation maps 86,101,104,173 , such as those shown in Figure 15, that compare the distance between pairs of structures in a reference dataset, as computed by two metrics. In the most extreme case, one observes pairs structures that are identical based on a metric, and distinct based on another -indicating the presence of a manifold of degenerate structures that are distinct, but cannot be told apart by one of the distances 104 .
An important aspect when defining a metric for structural comparison is the fact one is often interested in measuring the dissimilarity between entire structures, D(A, A ). Most of the representations we discussed this far are designed to compare atomcentered environments, and therefore yield D(A i , A i ). As a practical example, we define D as the Euclidean distance between the representations, where we use the abstract notation |A i rather than φ(A i ) to highlight the connection with the definition of the global representation |A; ρ ⊗2 as the sum of environmental |A; ρ i , introduced in Section IV C. Different ways of combining atom-centered representa- tions to obtain a structure-level comparison are discussed and benchmarked in Ref. 101, using a construction based on the definition of global kernels. Here we present the same strategies, but express them directly in terms of distances. The two formulations are equivalent when using the kernel-induced distance.
The simplest global distance can be defined as a mean over all environment pairs, where the last line shows that the average environment distanceD 2 (A, A ) can be computed by taking the Euclidean distance between the mean of the environment's features in the two structures. This construction is very natural, and consistent with an additive decomposition of properties in a regression model, but potentially lacks resolving power: two structures with very different environments could end up having a similar value of the average feature vector. An alternative way to determine a global metric involves finding the best match between the environments of the two structures, defininĝ (86) where U N A ×N A is the set of N A × N A doubly-stochastic matrices, i.e. matrices with positive entries such that sums of rows and columns all equal 1/N A and 1/N A respectively. When N A = N A , the optimal P contains only zeros and 1/N A , and the problem can be construed as a linear assignment problem, and solved in O(N 3 A ) time using the Hungarian algorithm 225 . Much like the case of the use of sorted interatomic distances as a structural representation (Section III B), the process of matching entries in the environment distance matrix introduces discontinuities in the derivatives of the distance metric. One can solve this problem, obtaining at the same time a scheme with a cost that scales as O(N 2 A ) and that can be applied to the comparison of structures of different sizes, by introducing an entropy regularization in Eq. (86) controlled by the magnitude of the parameter γ. This approach was introduced in Ref. 226 for the general problem of solving optimal transport problems and of evaluating the Wasserstein distance between probability distributions, and was first applied in Ref. 101 to atomistic problems in terms of regularized entropy match (REMatch) kernels. By introducing a nonadditive combination of the environments, REMatch kernels offer an increased resolving power compared to the plain average distance (85), as demonstrated in Figure 16. The figure also shows that Eq. (87) interpolates between the average and the best-match metrics, to which it tends respectively for γ → ∞ and γ → 0.

C. Representations for unsupervised learning
As stressed in the introduction of this Section, in performing cluster analysis or dimensionality reduction, the choice of featurization is not a neutral one, but introduces bias that will be visible in the end result of the analysis. 213 . While sometimes this bias is desirable, such as in Fig. 14 in which a judicious choice of features makes it possible to emphasize, or ignore, the orientation of grains in a polycrystalline sample, one should resist the temptation to fine-tune parameters that do not have an obvious meaning to obtain a result that reflects a preconceived interpretation of the data. The top row of Fig. 17 shows how different choices of the hyperparameters of the SOAP powerspectrum (cutoff radius r cut , density smearing σ g , and the types of atoms that are used as environment centers) change unpredictably the distribution of the points on the 2D map obtained by principal components analysis of the features. In the first panel, in particular, one can recognize a degree of correlation between the position of the points, and intuitive structural and energetic properties, such as the number of H-bonds, and the lattice energy. The correlation is however far from perfect, and with other reasonable choices of hyperparameters it disappears almost completely.
One possible approach to make unsupervised models less dependent on the details of the underlying featurization is to combine them with an element of supervised learning. This includes, for instance, combining or contrasting density-based clustering with (kernel) support vector machines classification 227 . Even more explicitly, one can combine a variancemaximization scheme analogous to PCA with the regression of a target property, as in principal covariates regression (PCovR) 228 . In PCovR one minimizes a loss built as a mixture of a PCA and a linear regression loss, weighted by a mixing parameter α The matrix P ΦT projects from the feature space to a low-dimensional latent space, P T Φ reconstructs an approximation of the full-dimensional feature vector based on its latent-space embedding, and P T Y regresses the property matrix Y using the latent-space coordinates as inputs. By explicitly looking for a latent-space projection that allows to regress linearly a target property, one forces the dimensionality reduction to identify a subspace of the chosen features that correlates well with one or more quantities of interest. The lower row of Fig. 17 is obtained using a recent kernel extension of this method (KPCovR 224 ) attempting simultaneously to maximise the spread of data and the kernel regression of the lattice energy, giving equal weight to the two components (α = 0.5). Not only points on the resulting map correlate very well with the target: one observes that also structural parameters such as the H-bond counts are now clearly separated between different regions, and the appearance of further groups of well-clustered structures that correspond to similar isomers of azaphenacene 224 . What is perhaps more important, introducing an explicit supervised learning target leads to maps that are more consistent across different choices of hyperparameters. Thus, (K)PCovR reduces the arbitrariness of the description, and mitigates the risk of implicitly introducing an unknown bias.

D. Analyzing representations and datasets
The unsupervised or semi-supervised analysis of a dataset helps building an intuitive understanding of complicated structure-property relations for a material or a class of materials. Given the "black box" nature of many machine-learning models (and the fact that even the rigorously-defined density correlation features we focus on in this review have a highdimensional nature and non-trivial relationship to the actual atomic structure) low-dimensional projections of the feature space can also be useful to gain a bet- ter understanding of the structure of feature space. For example, Fig. 18a tells us less about the QM7 dataset 222 (that contains small organic molecules containing C, H, N, O, S, Cl) than about the SOAP features that underlie the representation: the unsupervised analysis shows that the chemical composition is the most clear-cut differentiating characteristic when looking at this dataset with SOAP lenses. Fig. 18b visualizes the same QM7 data using a different representation, based on the Coulomb matrix, and shows how successive layers of a neural network transform these features into non-linear combinations that correlate very well with the target properties. Thus, this visualization helps understand how a highly-nonlinear function transforms a description of the system into combinations that can be more easily used for regression, and diagnose the inner workings of the deep neural network.
A final "introspective" application of this kind of analysis involves examining the structure of a dataset -not as a way to learn about the atomistic configurations it contains, but about its makeup, or the relationship with other datasets. An example is given in Fig. 19, showing the comparison between the chemical space covered by three databases of organic molecules, with QM9 and AA being mostly disjoint, and the more diverse OE molecules encompassing both the other sets. Other examples of this kind of analysis are discussed in Section IX.

E. Indirect structure-property relationships
The one-to-one mapping between an atomic structure and its representation is one of the key requirements to achieve accurate "surrogate quantum models" of atomic-scale properties. However, it can also be a limitation whenever one wants to describe properties prediction of the properties of the minimum-energy configuration of a structure; the problem can be made wellposed by using a cheap approximate method to optimize the structure, and taking the representation of this approximate structure as the input to regress accurate energy and geometry. (b) prediction of a property that is associated with a thermodynamic average; the minimum energy structure can be taken as a proxy for the ensemble, but a more formally precise "ensemble representation" is also possible.
that are not strictly associated with the specific configuration at hand. For example, consider the databases of molecular properties (e.g. the QM9 dataset 230 ) that have been extensively used as a benchmark, and have been a powerful driving force behind the development of the representations we describe here. The typical benchmark involves taking a structure whose geometry has been optimized at the DFT level and use it to predict the DFT energy -and exercise that is manifestly of little practical utility. What one would like to do, instead, is not having to specify the optimized geometry, and use a non-optimized structure to predict the properties of the nearest local configurational optimum. As shown in Fig. 20a, this is conceptually problematic, because we are now trying to achieve a many-to-one mapping. A possible solution is to map each distorted geometry to an idealized one, or to use a lower level of theory to determine an unique struc-tureÃ 0 . Thus, the many-to-one mapping is realized by the local optimization procedure, and the corresponding representation |Ã 0 can be used to uniquely identify the entire basin of attraction of the local minimum. Only for the training structures, this geometry is optimized further at a higher level of theory, obtaining the structure A 0 for which properties are meant to be computed. When the model is fitted, the relationship betweenÃ 0 and its high-quality counterpart is learned implicitly. This kind of "indirect" model has been used, for instance, in Ref. 28, where structures optimized at the semiempirical PM7 233 level were used to predict CCSD energetics computed for a DFT-optimized version of the same compound. While the error was almost twice as large as a model using directly |A 0 as input, chemical accuracy could be reached when discarding from the training set structures for which the DFT-optimized structure was too different from the PM7-optimized geometry. A similar conceptual problem arises when one wants to build models for properties that are associated with a thermodynamic state rather than a precise structure, such as a melting point or solubility of a material. Typically, and particularly if the ensemble consists in relatively small fluctuations around equilibrium, one might take a representative structure (e.g. the minimum energy configuration) and use the corresponding representation |A 0 as a proxy of the thermodynamic state (Fig. 20b). Alternatively, consider the case in which the target property can be estimated as an ensemble average, e.g. over a Boltzmann distribution at inverse temperature β, P (A) = e −βE(A) /Z, where Z = dA e −βE(A) is the canonical partition function. One can exploit the linear nature of the representation and define an "ensemble ket" With this definition, one could use a linear model for F (A) with weights q|F and see that which is convenient because it allows using properties of configurations and of ensembles on the same footings -and possibly combining them in a single training exercise. Note also that the same approach could be applied in a kernel setting, computing the ensemble average of the reproducing kernel Hilbert space vector associated with the structures.

VIII. EFFICIENCY AND EFFECTIVENESS
We have discussed in Section III how most of the existing choices of representations share profound similarities, and shown, in Section IV, that many alternative schemes can be formally related to each other by means of a linear transformation, smoothening or a limit operation. However, this is not to say that in practical applications they are entirely equivalent. The computational cost of evaluating them, and their performance in classifying structures, and in regressing their properties, is determined by the choice of basis functions. Even for formally equivalent representations, the condition number of the linear transformation between them and their corresponding bases have significant impact on the numerical behavior of the computed coefficients and the quantities derived from these coefficients.

A. Comparison of features
A preliminary question when comparing alternative choices of features for the description of atomic structures and/or environments is that of establishing an objective way of assessing their relative merits. The performance when used in the regression of useful atomic-scale properties is an obvious criterion, but such a comparison is intimately intertwined with the target property and the regression algorithm. 77,234,235 . Very recent efforts have attempted to characterize different representations in terms of their information content -for instance through the eigenvalue spectrum of the covariance or kernel matrix associated with a dataset, the decrease in accuracy when reducing the number of features 172 , or the sensitivity of the features to atomic displacements. This latter approach can be realized by directly comparing the separation in feature space against finite displacements of the atoms 172 , or through an analysis of the Jacobian J jk = ∂ k|A i /∂r j 173 . The sensitivity of the features to small changes of the atomic positions indicates their usability and performance in regression of classification tasks. Onat et al. 172 analysed the effect of random perturbations in crystalline environments, finding that, for features based on atomic density correlations, displacements of atoms in the environment usually cause a linear response. One notable deviation from this trend are perturbations along some high-symmetry directions in atomic environments carved from perfect crystals, where the response to displacements is secondorder, implying that the representations cannot capture these types of deformations (Fig. 21). However, as discussed in reference 172 the types of symmetric deformations applied in the study correspond to reflection operations. Due to the body-correlation order considered, features are invariant to mirror symmetry, and so the observed loss of sensitivity is not unexpected. Analizing the response of the features to perturbations in terms of the Jacobian, as in Ref. Another comparison between different bases is to analyse the landscape defined by the similarity or distance between environments, D(A i , A i ) where the environment A i is kept fixed. The distance between the atom-centered environments A i and A i can be defined as the Eucledian distance between feature vectors, Eq. (84). It is clear that the field defined by the Cartesian coordinates of A i will have a global minimum manifold where the field is exactly zero, corresponding to equivalent environments A i and A i that are related by symmetry operations. Whether there are other manifolds at exactly D(A i , A i ) = 0, corresponding to the same features resulting from symmetrically nonequivalent environments is related to the question of completeness (Section VI B). In practical applications, the shape of the global minimum manifold has also implications for the numerical evaluation. In particular, one could examine how different A i and A i may be for D(A i , A i ) < where is a small number. Using a random search approach, the numerical sensitivity of the feature landscape has been analysed in Ref 27. Reference structures A i were perturbed and then reconstructed by minimising the distance D(A i , A i ), and the optimised structures compared to the reference ones. For small numbers of neighbors in the reference environment, all the examined representations performed similarly well, but only SOAP was capable of accurately reconstructing the reference environments of more than 12 neighbors. As we have seen in earlier sections, this differences can be attributed to the choice of basis functions other representations use, although it should be noted that SOAP distances and similarities converge in the limit of a complete basis, therefore the actual form of the basis might affect the convergence, and the computational cost of the representation, but does not impact its resolving power.
A more explicit comparison between pairs of representations can be obtained by evaluating the error one incurs when using a set of features, arranged in a feature matrix Φ in which each row corresponds to a sample in a reference dataset, to linearly reconstruct a second featurization of the same structures or environments Φ , defining a global feature space reconstruction error (92) P is a linear regression weight matrix obtained on a training subset of the rows of Φ and Φ , and both sets of features are assumed to be standardised. 236 The GFRE can be extended to also incorporate nonlinearity in the mapping, either by a locally-linear approach, or by using a kernelized version, and is not symmetric. GFRE(Φ, Φ ) GFRE(Φ , Φ) indicates that the featurization underlying Φ is more informative than that used to build Φ , and vice versa. GFRE(Φ, Φ ) ≈ GFRE(Φ , Φ) ≈ 0 implies that the two featurizations contain similar information, but are not necessarily equivalent. One could emphasize more some structural correlations than others: imagine for instance multiplying by a large constant the entries of one column. This kind of distortions, which can have a substantial impact on the performance of models built on Φ or Φ , can be measured by defining a global feature space distortion (GFRD) (93) P is the same projection matrix that enters the definition of the GFRE (so that ΦP ≈ Φ ), and Q is the unitary transformation that best aligns Φ and the best linear approximation of Φ .
If both GFRE and GFRD are zero, then the linearly independent components of Φ and Φ are related by a unitary transformation, which implies that distances and scalar products between feature vectors are equal between the two representations being compared. Figure 22 demonstrates the use of these measures to compare |ρ ⊗ν i features of different body order. The asymmetry is very clear, with higher-order features containing more information than their lowerorder counterparts. Note that -in view of the linear nature of the mapping -this is not entirely obvious: formally, ν = 1 features are not linearly dependent on higher-ν features, and so these observations reflect the specific nature of the atom-density field whose correlations are being represented, and the nature of the structures in the benchmark datasets. The figure also includes invariants built with the N-body iterative contraction of equivariants (NICE) framework, that are designed to capture most of the information up to high body orders. The truncation of the expansion, that is necessary to keep the evaluation of ν = 4 order features affordable, leads to a small residual GFRE when reconstructing ν = 3 features. The GFRD is rather large between all featurizations, reflecting the fact that -even though higher-order features contain sufficient information to describe lower-order correlations -they weight the information differently, which is why it is often beneficial to treat different orders of correlation separately in the construction of interatomic potentials. 15,152,155,165 B. Feature selection Numerical feature vectors φ(A i ) are the result of a basis set expansion of atomic environments, which are, for practical purposes, truncated. A concrete discretization of the symmetrized ν-correlations is obtained by choosing a finite subset from the set of all possible features, (A choice of (n α , l α ) α naturally induces a choice of m α and symmetrized features.) The role of the discretization q is very different for linear and nonlinear models and therefore warrants a brief comment: For nonlinear models we typically only require geometric completeness (see Section VI), which means that the feature set can be chosen minimal but in a way so that all possible configurations, or at least all configurations of interest (e.g. from a training set) can be distinguished in a stable and smooth way. While it is an open problem to characterize precisely what this entails, we generally expect that relatively small feature sets on the order of hundreds for single-species scenarios could be sufficient. On the other hand, converging a linear model requires eventually letting the discretisation q converge to the full feature set q total , which in practice leads to a much larger set q and in particular higher correlation-orders ν to achieve a desired accuracy, e.g. on the order O(10 000) features for single-species models. The additional cost in training and evaluating the features is of course off-set by the fact there is no additional cost in evaluating the nonlinear models. Due to the large feature sets the selection of effective subset of q may be even more important in the linear setting. In particular it will be crucial to a priori choose sparse subsets of q total rather than tensor-product sets due to the combinatorial explosion of the number of features with high ν (curse of dimensionality). For example, a total-degree D discretisation, was used by Bachmayr et al. 107 , while closely related a priori sparsifications were used by Braams and Bowman 11 , Shapeev 111 in all cases demonstrating accuracy/performance competitive with or outperforming nonlinear models. Data-driven selections When using high-body order features, some of the components can be related by non-trivial linear dependencies, that can be enumerated numerically 107,123 . The construction does not ensure that there is no other linear dependence that is specific to a given dataset, meaning that feature vectors could potentially be compressed even further without noticeable deterioration in the quality of the representation. The benefits of the compression are clear: if only a few components need to be evaluated, significant efficiency gains may be realised both in computational effort and storage requirements.
Thus, the objective of feature selection or truncation is to find a subset of features that retain the information content of the original, untruncated representation. This is to be contrasted with dimensionality reduction techniques, that apply a linear transformation on the full feature vector to generate a lower dimensional representation. These only reduce the computational cost of operations that are applied on the reduced feature vectors: the whole feature vector must be evaluated first, before being able to determine its projections.
An example of a feature selection strategy is the farthest point sampling (FPS) technique 240 . One chooses an initial column ϕ c0 (indexed by c 0 ) of the feature matrix, and then iterate selecting the columns that maximize the Haussdorf distance to the previously selected columns effectively identifying the indices c of the features that have the most diverse values across the data set. FPS has also been used in a similar manner, but on the rows of Φ in order to select a representative set of data points. 28,101,241 The CUR matrix decomposition 242 , instead, generates a low-rank approximation of the feature matrix Φ, in the form Unlike singular value decomposition, CUR uses the actual columns (C) and rows (R) of Φ. To make the selection, a score is associated with each feature c based on the right singular vectors v i of the singular value decomposition of Φ. k is the approximate rank of Φ. Features may be selected in a probabilistic procedure or simply based on their score. Imbalzano et al. 243 argued that the scores associated with feature vector components which are linearly dependent are close, therefore the selection can easily result in a redundant set. Instead, in Ref. 243 a greedy algorithm based on the CUR decomposition was suggested, where features were selected iteratively. The feature with the highest score is selected, and the columns of Φ are orthogonalised relative to the column corresponding to the selected feature. The scores are updated in each step, so the linear dependence of already selected features are removed. This iterative scheme often performs better when using a very small value of k in constructing the leverage scores (98). Fig. 23 shows that a data-driven selection of the most relevant/diverse features makes it possible to achieve models with an accuracy that approaches that of the full model while reducing the number of components by a factor of about 3 (for linear regression) or 10 (for KRR). Particularly for intermediate sizes of the selection, the improvement in accuracy with respect to a random selection can be dramatic. Both FPS and CUR methods can be improved further by incorporating information on the properties associated with the structures, 239 as in Eq. (88). Including a supervised component by setting α < 1 in the feature selection usually leads to more performing models, as shown in Fig. 23. Feature selection methods can be applied to any flavor of density correlation features. Imbalzano et al. 243 used a reference data set on liquid water 244 and a large set of systematically generated ACSFs. Evaluating the RMSE of the predicted energies and forces revealed that automatic selections performed by a CUR or FPS approach may achieve similar performance to features selected based on chemical intuition and heuristics, while keeping approximately the selection size. A dramatic reduction in numbers of features is also possible for the SOAP power spectrum, and a data-driven selection of the most important components has quietly become commonplace to accelerate SOAP-based ML models 23,245,246 . A more systematic investigation of the effectiveness of feature selection for many commonly used atomic descriptors has been recently reported by Onat et al. 172 , who analysed how accurately the original feature vector can be reconstructed from the reduced set, as well as the performance on a practical regression task.

C. Feature optimization
As discussed in Section V C, non-linear models optimize the description of their inputs by generating new features that are best correlated with the target property, or that are adapted to the structure of the dataset. For instance, taking products of 2-body features results in an effective representation that incorporates some, but not all, features of body order 3, 4. . . In some cases it is possible to find an expression for the effective representation associated with a kernel model 27,98,165 , while other cases (most notably deep neural network models) put less focus on the interpretability of the intermediate features, and act largely as data-driven 'black boxes'. Alternatively, feature optimization can be performed explicitly on the representations presented in Section IV E. Such optimization could take the form of the choice of basis functions. In the Behler-Parrinello framework, it is customary to select a small number of on atomcentered symmetry functions based on experience and heuristics 115 . An optimization of the hyperparameters by gradient descent has also been proposed 247 to obtain more accurate models based on atom-centered symmetry functions.
When considering systematically-convergent implementations of density-correlation features, the optimization of the basis set is less crucial, although one may want to reduce the size of the basis for the sake of computational efficiency, as discussed in Section VIII B. That is not to say that the details of the practical implementation of the features does not change the behavior of a model built upon them. Op- Learning curves for the atomization energy of molecules in the QM9 data set 230 . Four of the lines show the MAE on the test set for kernel regression models based on SOAP ( |ρ ⊗2 i ) features with different cutoff radii (dashed lines graduating from red to blue). The other lines show the MAE on the test set for the optimal radially-scaled (RS) and multiple-kernel (MK) SOAP models (black and grey lines respectively). In every model, the features were constructed with very converged hyperparameters, nmax = 12 and lmax = 9. The inset shows the radial-scaling function u(r) from r = 0Å to r = 5Å with the parameters that were found to minimize the ten-fold cross validation MAE on the optimization set through a grid search, r0 = 2Å and m = 7. The multiple-kernel model combines the rcut = 2, 3, 4 and RS kernels in the ratio 100'000 : 1 : 2 : 10'000, and the learning curve agrees with the RS result to within graphical accuracy. Error bars are omitted because they are as small as the data point markers. Note that errors are expressed on a per-atom basis. Error per molecule expressed in kcal/mol can be obtained approximately by multiplying the scale by 0.4147, that is computed based on the average size of a molecule in the QM9 database. Adapted from Ref. 105. timizing hyperparameters such as cutoff radius, density smearing, basis set cutoff affect how naturally the features correlate with the target property, which is one of the factors determining how quickly a regression model becomes capable of performing accurate predictions 103 . For example, the smearing of the atom density, or the truncation of the basis set, should reflect the natural scale over which the target properties vary. Similarly, the size of the local environment determined by r cut relates to the typical decay length of interactions, as mentioned in Section III C, but it also changes the effective dimensionality of feature space, which affects the accuracy of the model in a non-trivial way. Consider the learning curves shown in Fig. 24, that report on the prediction accuracy, as a function of the train set size, for a kernel ridge regression model of molecular atomization energies, based on SOAP features that differ by the value of r cut . A very large cutoff r cut = 5Å does not yield the best performance, despite providing information on a wider range of dis- tances. In fact, one observes the need to balance the complexity of the model and the available data: a very short-range r cut = 2Å yields the most effective description in the data-poor regime, but the accuracy of the corresponding model saturates due to lack of information on non-covalent interactions. Combining multiple representations in a "multi-kernel" model (which is effectively equivalent to concatenating multiple feature vectors, each scaled separately) yields consistently better performances 21,28 . The weighting of different components -that can be optimized by crossvalidation -indicate the relative importance of correlations on various length-scales. The fact that larger cut features carry low weight in the optimal combination suggests that an improvement of performance can be obtained by calibrating the distance-dependent contributions of neighbors to the environment description. This can be achieved by introducing a radial scaling function (indicated as u(r) in Ref. 248, and as ξ ν (d) in Ref. 97) that downweights the contributions of atoms in the far field. As shown in Fig. 25 for the FCHL representation 97 , the choice of the form of this scaling can change the accuracy of the model by more than a factor of 2. A similar effect is seen in Fig. 24 for the case of SOAP features. It is also worth noting that the cutoff function usually adopted in Behler-Parrinello-like frameworks decays rapidly well before reaching r cut -suggesting that a similar optimization is implicitly at play. 126 Optimization of a radial scaling function has become commonplace, and most recent applications based on the SOAP power spectrum rely on it to achieve consistently optimal performance in both the data-poor and data-rich regime.
Rather than optimizing the correlations between geometric features and the target properties, one can Learning curves for a model of the cohesive energy of a database of elpasolite structures, each containing a random selection of four elements chosen among 39 main group elements. 249 The standard SOAP curve is shown in black, the best curve from Ref. 249 is shown in bright red (REF) and the curves obtained with an alchemical model with reduced dimensionality dJ are shown in dark red (dJ = 1), purple (dJ = 2) and blue (dJ = 4). The multiple-kernel model (shown in grey) combines three standard SOAP kernels with different cutoff and one alchemically optimized kernel with dJ = 4. Adapted from Ref. 105. attempt to build features that incorporate a notion of chemical similarity between different elements. The idea was introduced in terms of an alchemical similarity kernel in Ref. 101, that is also a core component of the FCHL framework 97 , but has been implemented in different forms in the context of atom-centered symmetry functions 250  correlation features 105 . In general terms, the idea is to achieve a reduction of the dimensionality of the chemical space, writing formally a (linear) projection of the elemental features α| = a α|a a| , where the coefficients α|a enact the projection between the elemental and the "alchemical" basis. The reduction in the dimensionality of the feature space can be substantial: for powerspectrum (ν = 2) features, the number of components scales quadratically with the number of species, and so even just halving the dimension of the chemical space reduces the number of powerspectrum features by 75%: α 1 |a 1 α 2 |a 2 a 1 n 1 ; a 2 n 2 ; l|ρ ⊗2 i . (100) Figure 26 demonstrates how reducing the dimensionality of chemical space helps achieving a transferable, accurate model with a small number of training structures. By comparing learning curves with different degrees of compression, one sees that there is a similar data/complexity interplay as observed for radial correlations. A low-dimensional alchemical space is beneficial in the data-poor regime, as it allows the model to make an educated guess about the interactions of pairs of elements that are not represented in the training set. Learning curves with low chemical complexity, however, saturate in the limit of large training set, because they generate features that are not sufficiently flexible, and cannot describe the differences between elements.
The optimization of both geometrical and compositional components of the density-correlation features can be construed as a linear transformation of the kets (or, when seen in terms of the linear kernels built on such features, as the action of a Hermitian operator 248 ). The requirement that such transformations do not affect the symmetry properties of the features restricts form they can take -for instance, they cannot mix different l or m dependent channels. These observations imply that (1) linear feature optimizations do not change the nature of the representations, and can be applied equally well to any implementation of |ρ ⊗ν i features, (2) as long as the linear transformation is full rank, there is no loss of information, which means that the observed change in performance is linked to the details of the regression scheme, such as regularization in linear or kernel models.
As a final remark, let us mention that a critical analysis of a feature-optimization effort often reveals insights into the physical-chemical properties of the system being studied and the target properties. For instance comparing models of the energy using different r cut can be used to infer relationships between the length and energy scales 28 , and the inspection of the chemical mapping coefficients in Eq. (99) can be used to construct a data-driven periodic table of the elements (see Fig. 27). The use of interpretable, physicsinspired features can also be used to provide intuitive chemical insights by the construction of knockout models 253 in which for instance correlations are restricted to 2-bodies, the cutoff reduced to first or second neighbors. The impact of these artificial restrictions on the features information content, and therefore on the asymptotic performance of the model, indicates how important 3 or higher-body order interactions are, or how much long-range effects are relevant to determine the value of the target property. 199

D. Efficient implementation
Despite encoding similar information content, the differences in formulation of competing structural representations may lead to large variations in implementation and performance. A first fundamental divide is between evaluation of features by summing over clusters of ν neighbors and computing ν tensor product of atomic densities (see Section IV F). Consider the case of evaluating "SOAP-like" ACSF by cluster sum (cost n max 2 l max n 2 neigh ) and by density expansion (cost n neigh n max l max 2 for the density, and n max 2 l max 2 for the SOAP evaluation). Despite the adverse scaling of ACSF computed as a sum over clusters of neighbors, these representations can be implemented efficiently 77,235,254 by relying on a careful selection of the features (discussed in Section VIII B), reuse of parts of the computations, parallelism and GPU acceleration. [255][256][257] In fact, when computing a linear model which is explicitly equivalent to a (ν +1)body order potential, the low-order terms can be more efficiently evaluated as a sum over neighbors 165,258 In line with the general focus of this review, we concentrate in particular on the efficient implementation of atom-density representations. As we shall see, roughly the same considerations apply to both those representations that are built on a smooth atom density, 90,105,123 that generalize the construction of the SOAP powerspectrum and bispectrum, 27 and those that are built on a δ-like density, such as ACE 106,124 and MTP. 111,259 Indeed, both families of representations rely on three steps: (i) expansion of the local atom density on a suitable basis, e.g. Eq. (22), (ii) computation of ν tensor products of the expansion, and then (iii) contraction over the correlations to obtain equivariant features. While these three steps have been implemented in different ways, their efficient implementation relies on similar considerations.
Atomic density expansion Equation (18) provides the blueprints for a broad class of (ν + 1)-body atom-density representation. Practical implementations differ by the type of localized function used to construct the local atom density (see Eq. (14)), and by the radial and angular basis used for its expansion. As discussed in Section IV E, spherical harmonics are a natural angular basis, but other choices are possible. For instance, the MTP representation projects the atomic density onto a tensor product of direction vectors leading to the covariant moment tensor 111,212 M ⊗ν n (ρ i ) = j∈Ai P n,ν (r ji )r ji ⊗r ji . . . ⊗r ji ν times (101) where P µ,ν is a radial function. Invariant components can be obtained by combining and contracting products of the elements of these tensors. The tensor product basis is directly related to spherical harmonics, as shown in appendix B.2 of Ref. 107. The performance of the MTP representation, 235 which relies on an efficient recursive evaluation of the basis functions 111 , are a testament to the effectiveness of this basis choice. As shown in Section IV E, the choice of an angular basis of spherical harmonics simplifies greatly the evaluation of Eq. (18), that can be written in terms of contractions of density coefficients nlm|ρ i (cf. Eq. (22)). If the environment-centered density is written in terms of a sum of density functions g(x−r ji ) ≡ x − r ji |g , peaked at the neighbors positions, the expansion coefficients can be written as the accumulation nlm|ρ i = j∈i f cut (r ji ) nlm; r ji |g (102) of terms that correspond to an expansion over a basis of radial functions x|nl and spherical harmonics x|lm of contributions coming from Gaussians centered on each neighbor nlm; r|g = dx nl|x lm|x x − r|g . (103) In the g → δ limit, the contribution from the j-th neighbor amounts simply to a product of the radial and angular functions evaluated at r ji , nlm; r ji |δ = nl|r ji lm|r ji .
Each of these terms can be computed very efficiently, exploiting in particular the fact that all orders of the spherical harmonics and their derivatives can be computed using recursion relations. 106,107,260 It might appear that using a smooth atom density g complicates substantially the evaluation of Eq. (103). However when g is a spherical Gaussian with standard deviation σ, the integral over dx can be computed analytically 261 dx lm|x xx − r ji |g = l; x; r ji |g lm|r ji .
can be computed numerically for any form of the radial basis resulting in n max l max n grid evaluations of special functions. For instance, the original implementation of the SOAP representation uses a numerically orthogonalized, equispaced Gaussian basis. 27 Alternatively, this integral might also be performed analytically by using Gaussian type orbitals (GTO) as the radial basis 133,262 . This choice makes it possible to compute the coefficients of the smeared density as easily as for the g → δ case nlm; r ji |g; GTO = lm|r ji nl; r ji |g; GTO , where the only overhead comes from having to compute O(n max l max ) terms for the radial part and its orthonormalization. This is asymptotically cheaper than combining radial and angular terms -which requires O n max l max 2 multiplications per neighborbut can be substantial in practical cases, because the analytical integrals in Eqs. (105) and (108) yield nonstandard special functions.
To reduce this overhead, one can choose a form of the atomic density that is symmetric about r i instead of r ji (1 −r ji ·x) .
(110) Together with a choice of radial functions that do not depend explicitly on l, this allows factorizing the radial integral (108) as dx n|x x − r ji ; l|g = n; r ji |g l; r ji |g .

(111)
Coupled with the polynomial basis proposed in Ref. 27, these expansion coefficients can be computed efficiently using recurrence relations in the radial and angular coefficients. More in general, the cost of evaluating the radial integrals nl; r|g can be made negligible by using splines to approximate the value of the special functions resulting from the integrals, or the numerical integration of basis functions for which there is no analytical expression. Another aspect that does not affect the asymptotic scaling of the expansion, but can significantly influence the prefactor, involves the evaluation of spherical harmonics. 124,260 Several well-established techniques can be used to speed up the calculation of Y m l , including the use of real-valued spherical harmonics, the use of recurrence relations, and the use of formulations that are entirely written in terms of the Cartesian components ofr ji .
Symmetrized n-body correlations The density coefficients anlm|ρ i are then combined to compute invariant (or covariant) features. Formally, the evaluation of the symmetry-adapted features -both those built using only local |ρ i features, and the multi-scale features that combine |ρ i and |V i -involves a tensor product of ν sets of density coefficients to yield density correlations in the uncoupled basis (a i n i l i m i ) ν i=1 |, and then a contraction along the m i indices, that generates the equivariant features expressed in the coupled basis (a i n i l i k i ) ν i=1 |. A technical difficulty one has to keep in mind when implementing the calculation of equivariant features is that the angular (l, m) indices have an irregular memory layout, with −l ≤ m ≤ l. Depending on the hardware architecture, it might be beneficial to store the coefficients in a regular (l max + 1) × (2l max + 1) array, padded with zeros.
A more substantial challenge associated with the increase of the body order is that both the number of linearly independent features and the cost of evaluating each of them based on a naive contraction of the tensor products of density coefficients (e.g. based on To beat completely the exponential scaling, these recursive expressions should be combined with feature selection schemes such as those discussed in Section VIII B. For example, the n-body iterative contraction of equivariant (NICE) features incorporates a selection/contraction step at each level of the iteration. For each equivariant component q |ρ ⊗ν i ; σ; λµ , one determines (e.g. by principal component analysis, or just by dropping some components) a set of coefficients U ν;σλ q q that can be used to reduce the dimensionality of the features q ν;σλ |ρ ⊗ν i ; σ; λµ = q U ν;σλ qq q |ρ ⊗ν i ; σ; λµ . (112) Given that this operation only mixes features with the same equivariant behavior, it is then possible to perform an iteration equivalent to Eq. (40) to increase the body order further q|ρ ⊗(ν+1) i ; σ; λµ ≡ q ν;τ k ; nlk|ρ . (113) Note that in the first line we use the loose definition of the indices in the bra-ket notation (Section IV A): the ν + 1 term can be indexed explicitly, with a notation that recalls the lower-order terms that are combined to obtain it; once it is computed, the granularity of the indexing becomes irrelevant, and a flat index can be used to streamline the notation. With this combination of expansion and contraction only the components that contribute significantly to the description of the structural diversity of the dataset, or to the prediction of the target properties, are retained to evaluate higher-order correlations. An alternative perspective for developing efficient implementations is to represent invariant or equivariant properties F (A i ) in terms of the unsymmetrized correlations, with the desired symmetries imposed through constraints on the coefficients F |n 1 l 1 m 1 · · · n ν l ν m ν . While this perspective imposes additional complexity on regression schemes it is convenient for fast evaluation of a fitted model (with coefficients now ensuring the correct symmetries) since the coupling coefficients need not be stored or evaluated anymore. An efficient evaluation now requires a recursion for the unsymmetrized correlations n 1 l 1 m 1 ; . . . n ν l ν m ν |ρ ⊗ν i = ν α=1 n α l α m α |ρ i , which is relatively straightforward to construct, 107 the key challenge being to retain only the (n α , l α , m α ) α features that give rise to non-zero coefficients.

E. Packages to evaluate atom-density representations
To provide a practical example of the use of software to compute representations, we compare three packages, namely quippy 263 , Dscribe 262 and librascal 264 , that are open source, and can be easily used in a Python code. We do not discuss the internals of the implementations, but show code snippets that can be readily used to evaluate descriptors of atomic structures, primarily focusing on the SOAP powerspectrum. All examples use the Atomic Simulation Environment 265 , and atomic structures are assumed to be stored in the variable structures, an instance of ASE's Atoms object. In all of these implementations, the descriptor vectors are returned as numpy.array objects, from which kernel values may be obtained by computing the dot products between descriptor vectors.
The quippy python package is based on the QUIP suite with the GAP extension, which provides the descriptors module. QUIP must be downloaded and built using a Fortran compiler before quippy, which uses f90wrap to access the compiled functions in QUIP via python interfaces. In the GAP implementation, Gaussian radial basis functions are used, placed at equal intervals, and orthogonalized. from quippy.descriptors import Descriptor soap = Descriptor("soap cutoff=3.5 cutoff_transition_width=0.0 atom_sigma=0.3 n_max=8 l_max=6") # returns a dictionary containing the features and connectivity information features = soap.calc(structures) # features["data"] is a numpy array with the shape (n_environments,n_features) The Descriptor object is initialised using a string containing the kernel parameters in a key=value format, with some keys being mandatory. The dscribe package 262 implements multiple descriptors, including SOAP, MBTR 29 and ACSF. A python interface is used to interact with calculator functions written in C/C++, ensuring efficient evaluation. The main difference between the SOAP implementation of quippy and dscribe are the choice of radial basis functions, which are spherical primitive Gaussian Type Orbitals (GTOs), orthogonalised using the method suggested by Löwdin 266 . Alternatively, cubic or higher order polynomials may also be chosen. In analogy with the definition of GTOs used in quantum chemistry, the radial basis has an explicit dependence on l. The the python object providing the descriptor is constructed from the class SOAP and specifying the parameters in the initialisation arguments.
The package librascal also provides a variety of descriptors, but chiefly focuses on the calculation of The panels demonstrate the convergence of SOAP features as computed for 10'000 random CH 4 configurations 237 using different codes and radial bases, and measured in terms of the error one incurs when linearly predicting fully-converged features Φ full (nmax = 24, lmax = 12, computed with the GTO implementation in librascal) using features with lmax = 8 and growing values of nmax (left) and with nmax = 14 (16 for librascal) and growing values of lmax (right). Top panels show the linear reconstruction error (GFRE) which measures the amount of information that cannot be linearly decoded from the coarser features. Bottom panels show the reconstruction distortion (GFRD) which measures the additional error one makes when limiting the reconstruction to an orthogonal transformation. 236 density-based representations, including SOAP and the ν = 1 and ν = 3 correlations. The back-end, written in C++, can be accessed from python interfaces. Exploiting the spirit of the general construction of |ρ ⊗ν i features, librascal implements two kinds of radial functions, namely a family of GTO-like radial functions 131 as well as a discrete variable representation (DVR) basis, corresponding to a real-space evaluation of the symmetrized density using a Gauss-Legendre quadrature rule. The SphericalInvariants object uses the transform method to compute SOAP features, that are stored internally in a sparse format, in which each dense block corresponds to a (a, a ) pair of elemental densities. These features can be used to compute scalar-product kernels between two environments, or cast to a dense array through the get_features method.
These three packages all compute "SOAP" features, but differ in the choice of basis functions. Much as with electronic structure codes, that often yield results that differ significantly despite performing nominally the same type of calculations, 267 one cannot expect to be able to combine the features computed by one package with the regression weights computed by another. It is however important to assess whether the features are equivalent in a less stringent sense, e.g. whether they contain analogous information, and whether they converge to the same limit when the expansion parameters (n max , l max ) are increased. Figure 29 demonstrates the convergence of the GFRE and GFRD (see Section VIII A and Ref. 236) between small-(n max , l max ) features and a highly converged Φ full featurization. In all cases we consider GFRE(Φ full , Φ) is at least one order of magnitude smaller than GFRE(Φ, Φ full ). One sees that, reassuringly, in all cases the feature reconstruction errors converge towards zero. For n max = 16 all choices of radial bases are essentially converged, and the residual error is due to the convergence of the angular channels. Since all implementations use equivalent spherical harmonics expansions, the convergence with the angular cutoff l max is nearly identical. The convergence rate of the radial bases, however, is not the same. The GTO bases in librascal and dscribe have similar amounts of information, and converge faster than the bases used in quippy and the librascal DVR implementation. The GFRD also converges to zero for most implementations -meaning that in the complete basis set limit the corresponding features become equivalent. The implementation in dscribe is an exception, with a GFRD saturating at approximately 0.1, suggesting that implementation details lead to persistent differences in the weighting of different kinds of correlations even when (n max , l max ) increase beyond the values that are typically used in practice.

IX. APPLICATIONS
In this Section we want to report some representative applications that highlight different aspects of the representations discussed in this Review -demonstrating how an understanding of the nature and properties of the structure/features mapping can be used to construct efficient and insightful machine-learning models.

A. Best match kernels for ligand binding
Contrary to the problem of predicting interatomic potentials, or other extensive proteins, the affinity between a protein and a small drug-like molecule does not fit well into the mold of an additive property model. The structure of the ligand must allow for the active portion of the molecule to fit in the binding pocket of the target protein, and the nature of the chemical groups in this "warhead" portion are more important to determine the strength of the interaction than peripheral portions of the molecule. Figure 30 shows the accuracy of a classifier based on SOAP fea-tures, that aims to distinguish active components from decoys for a given target protein. The targets and the ligands, as well as their "ground truth" binding behavior are taken from the database of useful decoys, enhanced (DUD-E) 43 . The performance of the classifier is represented in terms of the receiver operating characteristic (ROC) curves (the ROC curve of a perfect classifier would run along the left and top margins of the plot, while a classifier that is as good as random would run along the diagonal), and their area under the curve (AUC) (the AUC is the integral of the ROC, and roughly corresponds to the fraction of molecules that are classified correctly). The AUC plot, in the inset of Fig. 30 shows that a model based on an average metric -that describes each molecule as the average of its environments, Eq. (85) -performs rather poorly, which is unsurprising given the highly non-additive nature of the binding affinity. Using a "best-match" kernel (equivalent to the distance in Eq. 86, and implemented in practice as the small-γ limit of the REMatch kernel 101 ) improves dramatically the accuracy of the classifier, bringing the AUC to well above 0.95. A judicious choice of the training structures, based on farthest point sampling, accelerates even further the convergence of the classifier with train set size. This application provides an example of how local representations can be combined in a nonadditive way, resulting in a dramatic improvement of the machine-learning performance for a problem in which non-additive behavior is to be expected.

B. Tensorial features and polarizability
Some of the early examples of machine-learning models leveraging covariant features focused on the prediction of dielectric response functions, such as the dipole moment µ (and the equivalent bulk quantity, polarization), polarizability α (and the closely-related electronic dielectric constant) as well as higher-order terms, such as the first hyperpolarizability β. We discuss the case of the static dipole polarizability α as a representative case that highlights many of the current ideas and applications. In its Cartesian form, α is a symmetric tensor, fully determined by six components (α xx , α yy , α zz , α xy , α xz , α yz ). In order to build a machine-learning model based on equivariant density correlation features, it is more convenient to apply a unitary transformation that casts it into its irreducible spherical components (ISCs). The spherically symmetric term, α (0) 0 , corresponds to the trace of the tensor, while the 5 anisotropic components, α {−2,−1,0,+1,+2} transform collectively as λ = 2 spherical harmonics, and can be computed using recursive relationships that are explicitly reported in Ref. 132. A clear advantage of this construction is that, unlike the components of the Cartesian tensor, the two ISCs of α can be independently represented by the equivariant density-based features corresponding to λ = 0 and λ = 2, relying on a linear prediction model similar to the one reported in Eq. (37). 23,131,133 The inherent locality of the model means that the tensor prediction can be broken down in the sum of individual atomic contributions α = i α i . These local components can be combined to make predictions on larger, and more complex molecules than those included in the training set. This transferability was exploited in the AlphaML model 23,269 to fit against coupledclusters (CCSD) reference values, computed on small organic molecules from the QM7b dataset 222,268 , and predict on 52 larger "showcase" molecules that are at the limit of what is computable with state-of-the-art quantum chemistry methods. On these molecules, the error of AlphaML against the CCSD reference (0.24 a.u./atom) was less than half the discrepancy between CCSD and DFT (0.57 a.u./atom).
An additive model also provides predictions for the local contributions to α, which are represented in Fig. 31, in terms of ellipsoids aligned along the principal axes of α(A i ). Even though these components do not have to be physically meaningful -given that the only training target is given by total polarizabilities -the local α(A i ) reflect some chemical insights, e.g. the model predicts large components when centering the representation on the highly-polarizable sulfur atoms, as well as along the directions where the molecules are highly polarizable. Highly conjugated molecules are also interesting because they exhibit a non-additive behavior of the polarizability, due to the vanishing HOMO-LUMO gap. Due to the spa- tial nearsightedness of the representation, the model breaks down when asked to predict the polarizability of large polyenes and polyacenes based on the information learned on simpler and smaller molecular units. This is well represented in Fig. 32, where the prediction of α is tested for conjugated carbon-based molecules of increasing size, including fullerene 23 . The prediction of α using equivariant features can also be extended to the condensed phase and provides a crucial ingredient to compute Raman spectra. An example of this is reported in Ref. 134, where the polarizability of crystal polymorphs of paracetamol are predicted along a full molecular dynamics trajectory, thus allowing for the calculation of the Raman intensity in terms of the polarizability correlation spectrum. As shown in Fig. 33, given the local  nature of the polarizability response in this kind of systems, accurate Raman intensities and lineshapes can be predicted for the entire range of frequencies.
The low cost associated with computing dielectric response functions by ML models using symmetryadapted features makes it possible to routinely evaluate condensed-phases infrared and Raman spectra including also a description quantum mechanical nature of the nuclei 271 -a task that until very recently required enormous computational effort 272 .

C. Long-range and non-local responses
The clear breakdown of a ML model based on local features that is apparent in Fig. 32 is representative of a general limitation of density-based features. There are essentially two approaches one can take to tackle the issue of the non-locality of the structureproperty relations, both of which are illustrated in Figure 34. One approach is to learn a proxy of the target property, which has a more localized nature, and which can then be easily manipulated to obtain the end result. The top panel of Fig. 34, adapted from Ref. 133, is an example of this approach. The electronic dielectric response ∞ of bulk water is affected by a collective, macroscopic electrostatic effect that is captured, in the continuum limit, by wellknown expressions such as the Clausius-Mossotti relation, α = V ( − 1)/( + 2), that links ∞ to an effective molecular polarizability. This effective α is more readily learnable by a local model, leading to better accuracy and transferability in predicting ∞ .
A different approach is needed when there is no obvious transformation of the target property to a more local version, as is the case for the polarizability of conjugated hydrocarbons. In these cases, one needs a model that is able to describe arbitrary nonlocal correlations. Long-range representations such as multiscale LODE features (Eq. (46)) are particularly attractive, in that they combine a long-range character (coming from the potential field) with an additive decomposition that provides the transferability needed to extend the prediction to systems of increasing size. This is demonstrated in Fig. 34, where the multiscale LODE model is tested for predicting the isotropic component of the polarizability of a series of polypeptides of increasing length. While the prediction at small peptides lengths share a similar accuracy as that obtained using a pure density-based representation, the inclusion of the potential field greatly decreases the prediction error when considering longer molecular chains. A large, overall improvement of the prediction accuracy is observed when adopting an optimized, weighted combination between local and LODE features. These result suggests that the inclusion of long-range features within the regression model provides a better description of the intermediate-range interactions, and that by adjusting the relative importance of local and delocalized terms the model can be trained only on small molecules, and extrapolate reliably across systems of increasing size. Very similar findings were reported on the transferability of models of molecular dipoles 140 -where however the splitting between local and long-range physics was achieved by combining different regression models rather than by different choices of features.

D. Electronic charge densities
Another relevant scenario where the data-driven prediction of a quantum property benefits from a representation that relies on the use of local and equivariant features is the electron density n e (r) of an atomic structure. The density is a scalar field, and has been modeled with some success by predicting its value at a specific point by an invariant representation centered on that point 273,274 . Given that (particularly in the case of an all-electron calculation) atomic nuclei are a natural vantage point to decompose the overall electron density, one may want instead to model n e (r) as a sum of atom-centered contributions, n e (r) = i n e (A i ; r). These atom-centered terms can then be conveniently decomposed in terms of local functions, at the price of adopting a multicentered non-orthogonal basis for the expansion i.e., where R n represent some suitably optimized radial functions (for instance those used in resolution of the identity methods in quantum chemistry 275 ) and c nλµ (A i ) correspond to the non-orthogonal expansion coefficients, that depend on the arrangement of atoms in the environment A i . These coefficients must transform in a covariant fashion with a rotation of the environment, and each λ-component can be independently predicted using equivariant features of the corresponding order -for instance with a linear model c nλµ (A i ) ≈ q c nλ |Q Q|A i ; ρ ⊗ν i ; σ; λµ , (115) even though current implementations used a kernel regression scheme 22,276 . The non-orthogonality of the basis used to represent n e (r), implies that the learning phase has now to be performed considering all the different density components at the same time 22,276 . While this may sound as a computational drawback of the model, it also improves the locality of the coefficients, which underlies its remarkable transferability across vast conformational and chemical spaces, since the electron density can be effectively learned as a collection of local contributions. This is well exemplified in Figs. 35, where the electron density prediction of C(8) hydrocarbons 22 is tested upon having trained the model on much smaller compounds, with only 4 carbon atoms. This approach has since been applied to more complex systems, such as oligopeptides 276 , and to the prediction of other scalar fields such as the on-top density 277 .

E. Structural classification and structural landscapes
As discussed in Section VII, the choice of a representation to describe atomic structures determines the "lens" through which they are interpreted, which in turns has a strong impact on the way unsupervised learning schemes, such as clustering and dimensionality reduction, bring to light recurring patterns, and structure-property relations. The potential of generalpurpose, atom-density correlation features for these tasks has been recognized rather early. Figure 36 Sketch map of the structural similarity of 15,869 distinct PBE-DFT geometry-optimised ice structures, as computed by the Euclidean distance in SOAP space. The sketchmap coordinates are obtained by minimizing the error in reproducing this similarity in terms of distances between points on a 2D projection.
The density and static lattice energy of each structure are encoded by the size and colour of the respective point, and correlate strongly with the position on the map. Known ice phases are labelled in blue.
34 new candidates are labelled in black and numbered in order of increasing dressed energy relative to a generalized convex hull 279 construction. Reproduced from 280. shots taken from simulations of different phases of water, based on Steinhardt order parameters 205 , which are closely related to |ρ ⊗2 i features, and make it possible to partly differentiate between phases. In the same study it is shown how a neural network based on atom-centered symmetry functions can be trained to achieve near-perfect classification accuracy.
An even more comprehensive mapping of the phase diagram of water -in which crystalline and amorphous phases from across the phase diagram, as well as transition pathways between them were consideredwas produced in Ref. 187, using permutation-invariant vectors 282 as global descriptors for the different configurations. Abstract structural descriptors are particularly useful when applied to datasets that contain hypothetical structures, generated by a high-throughput procedure 1 . In combination with a dimensionalityreduction scheme 4 , and with a generalized convex hull construction that attempts to estimate the synthesizability of materials by considering jointly their predicted stability, and the structural similarity to other potential candidates 279 , a SOAP representation has been able to rediscover all known (meta)stable ice phases, as well as to propose another 34 structures which might be also stabilizable by pressure, doping, or co-crystallization 280 (see Fig. 37).
An incomplete list of applications that use generalpurpose features for structural analysis and classification includes: the construction of structureproperty maps for small organic molecules 283,284 , molecular materials 223,285-287 , corrosion inhibitors 288 ; the identification and characterization of defects in solids 289-291 and self-assembled polymers 292 ; the classification of secondary-structure patterns in polypeptides 227 and the building blocks of zeolites 199 ; the classification of different phases in multi-phase materials 293,294 ; the search of stable phases of materials 295 ; the determination of the convergence of microsolvation studies of the hydration free energy 296 .
There are also several examples, besides those given in Section VII where low-dimensional maps have been used as a tool to understand the structure of a data set or the nature of a representation. In Ref. 81 a PCA map was used to understand the effect of randomizing the atom ordering on the feature space associated with a Coulomb matrix description of molecules, emphasizing the information loss associated with sorting of the elements -an alternative route to achieve permutation invariance. In Ref. 101, maps based on different kinds of SOAP kernels provided an understanding of the effect of different approaches to combining environment-level kernels, and of different definitions of an alchemical kernel between chemical elements, on the similarity between molecules as measured by the representation. In Ref. 77, maps of a dataset of water oligomers were used to compare the performance of different ML scheme to build 2 and 3-body models of the energy of water clusters.
The use of low-dimensional representations to visualize the structure of a dataset, showing the relationship between different kinds of training structures, identifying regions that are poorly sampled, and determining how new configurations relate to the data the ML model has been fitted, is also gaining traction. 28,95,96,101,246,297,298 An example of such map is given in Fig. 38, showing the diversity of the structures used to train a transferable machine-learning potential for carbon. 281 Adopting the same type of representations used for the regression model as the basis of this kind of analysis ensures that the maps describe the same feature space that underlies the fit.

F. 3D representations for QSPR
Even though the focus of this review is on descriptors of the 3D structure of materials applied to the construction of surrogate models of quantum mechanical properties, there is also growing interest in their application to QSPR tasks. As we briefly discuss in Section III, the descriptors that have been traditionally used in cheminformatics are based on a collection of molecular properties, or on molecular graph descriptors that do not depend on the particular conformation. 299 From a conceptual point of view, their coarsness is an advantage, because it is compatible with the definition of thermodynamic properties that are not associated with a single specific configuration, such as solvation and ligand binding free energies. Nevertheless, there is growing evidence that the use of descriptors incorporating information on the 3D geometries can improve the accuracy of QSPR models, especially for difficult cases that involve very flexible molecules, 300,301 as well as for data analytics approaches for materials informatics 302 . One of the core challenges in these efforts is the determination of the conformer geometries that should be used to evaluate the 3D descriptors, an operation for which several strategies have been explored to enhance the accuracy of QSPR models. 303,304 As we briefly discuss in Section VII E, one of the most promising research directions involves combining the high fidelity of density based representations with a well-principled construction of ensembles of features. This is still a very active subject of research, with very encouraging results having been recently demonstrated for the prediction of the solubility of small molecules 305 or the computational screening for antiviral drugs. 306

CONCLUSIONS AND OUTLOOK
The description of atomic structures in terms of mathematically sound, computationally efficient, and physically-inspired representations has largely driven the extraordinarily successful application of machine-learning schemes to atomic-scale modeling. Independently-developed representations have undergone a process of convergent evolution to fulfill a concurrent set of requirements, such as symmetry with respect to translations and rotations, smoothness and injectivity -a clear indication of the importance of these properties to obtain efficient machine-learning models. Over the past few years, a more systematic study of the problem of representing atomic structures has clarified the connections between most of the successful representations, and between these and well-established concepts in statistical physics of liquids (ν-point density correlations) and of alloys (the cluster expansion), as well as with the construction of potential energy surfaces for molecules and the condensed phase. A better understanding of the mathematical properties of representations is likely to lead, in the near future, to more robust and better performing implementations.
A formal treatment of symmetries also enabled the development of equivariant features that are suitable to build models that automatically obey the same transformation rules as vectors and tensors, making it possible to learn efficiently properties such as dipole moments, polarizability and density fields. This equivariant formulation can also be used to iteratively increase the body order of a structural representation -an important open question is how to best treat these high-body-order terms, whether by linear models that explicitly include dedicated highorder features, or by non-linear models that generate (some of) them algebraically. The answer rests both on practical considerations and on the very fundamental, highly non-trivial issue of whether a representation of limited body order provides a complete (injective) description of an atomic structure.
Another open challenge is how to deal with nonadditivity, and with properties that depend on long-range interactions between far-away atoms. Particularly promising is a long-distance equivariant framework, that can be formulated as as a rather straightforward extension of the same density-correlations scheme that underlies local features, and can be related to a multipole expansion of interactions. It is yet to be seen whether it can describe more subtle physical phenomena such as quantum delocalization, polarization and charge transfer, and how it compares with more explicitly physically-motivated "hybrid" models. A better control of the multi-scale nature of the interactions, including the use of "multi-resolution" features, is likely to be one of the focal point of featureengineering efforts, which may lead to an incremental -but nevertheless important -increase of the accuracy of ML models of matter. The optimization of features for a specific problem may however impact their general applicability, which is one of the critical advantages of the class of abstract, generic representations we focus on in this review, that can be seen as the point of convergence of molecular potential energy surfaces and condensed-phase potentials.
One of the most recent research directions aims at extending even further the reach of this framework, by resolving the divide between three dimensional continuous representations and discrete fingerprints, for applications to quantitative structure-property relations. The challenge here is to reconcile the superior resolving power of 3D, atom-density-correlation representations with the fact that traditional cheminformatics tasks aim to predict "macroscopic" properties, such as solubility or toxicity, that are not associated with an individual configuration, but rather with the ensemble of conformers that are associated with a specific thermodynamic state point. Another traditional application of cheminformatics is the inverse design of molecules with prescribed (or optimized) properties, and the construction for generative models. While one could envisage to use 3D representations for this task, a substantial hurdle would be the fact that the map between structure and density-correlation features is not bijective: there are feature vectors that do not correspond to any structure, and even feature vectors that cannot be obtained as a symmetrized correlation of an arbitrary scalar field. Thus, the unconstrained search for the "optimal feature vector" might result in a set of features that do not correspond to an actual structure. Until this issue is better understood, efforts to use atom-density representations for inverse design should rely on approaches that do not require an inverse feature map.
In the quest for more accurate and efficient machine-learning models of the structure and properties of atomistic systems, physically-motivated concepts have been incorporated into the mathematical representation of atomic configurations, resulting in striking similarities with traditional modeling frameworks. When treading the fine line between data-driven and physics-based approaches, the core question is how to achieve a natural description of wellunderstood phenomena without giving up the flexibility to model unexpected, complex effects -and how to build features that can be optimized for a specific application, while still being universally applicable.

ACKNOWLEDGMENTS
The authors would like to thank Yasushi Shibuta for providing the structures used in Fig. 14, and Stefan Goedecker for providing Fig. 15, and the many colleagues and friends who discussed with us about this review, and the ideas it summarizes. FM, MC and AG acknowledge support by the National Center of Competence in Research MARVEL, funded by the Swiss National Science Foundation.

AUTHOR BIOGRAPHIES
Félix Musil studied physics at the EPFL and received his MSc in applied physics in 2015, with a thesis on the modeling of plasma in a fusion reactor. For his PhD he joined in 2016 the group of Prof. Ceriotti at the EPFL to develop and apply methods to investigate structure-property relationships in materials using atomistic modeling and machine learning techniques.
Andrea Grisafi studied chemistry at the University of Pisa and Scuola Normale Superiore of Pisa. In 2016, he received his MSc in physical chemistry with a thesis on the statistical mechanics of simple ionic liquids. Since then, he is a PhD student in the group of Prof. Michele Ceriotti at EPFL, where he works on the development of atomic-scale representations that are suitable to incorporate physical symmetries and long-range effects within machine-learning models of molecular and materials properties.
Albert P. Bartók is an Assistant Professor at the University of Warwick. He earned his PhD degree in physics from the University of Cambridge in 2010, his research having been on developing interatomic potentials based on ab inito data using machine learning. He was a Junior Research Fellow at Magdalene College, Cambridge and later a Leverhulme Early Career Fellow. Before taking up his current position, he was a Research Scientist at the Science and Technology Facilities Council. His research focuses on developing theoretical and computational tools to understand atomistic processes.
Christoph Ortner is Professor of Mathematics at the University of British Columbia (Canada). After obtaining his doctorate in numerical analysis in 2007 at the University of Oxford (UK) and remaining there as an RCUK fellow, he moved to the University of Warwick in 2011 and to UBC in 2020. His main interests revolve around mathematical and computational aspects of atomistic and multi-scale modeling.
Gábor Csányi is Professor of Molecular Modelling at the University of Cambridge (UK). He obtained his doctorate in computational physics (2001) from the Massachusetts Institute of Technology (USA), having worked on electronic structure problems. He was in the group of Mike Payne in the Cavendish Laboratory before joining the faculty of the Engineering Laboratory at Cambridge. He is developing algorithms and data driven numerical methods for atomic scale problems in materials science and chemistry.

Michele Ceriotti is Associate Professor at the
Institute of Materials at theÉcole Polytechnique Fédérale de Lausanne. He received his Ph.D. in Physics from ETH Zürich in 2010, under the supervision of Professor Michele Parrinello. He spent three years in Oxford as a Junior Research Fellow at Merton College, and joined EPFL in 2013, where he leads the laboratory for Computational Science and Modeling. His research interests focus on the development of methods for molecular dynamics and the simulation of complex systems at the atomistic level, as well as their application to problems in chemistry and materials science -using machine learning both as an engine to drive more accurate and predictive simulations, and as a conceptual tool to investigate the interplay between data-driven and physics-inspired modeling.