Computational Reverse-Engineering Analysis for Scattering Experiments (CREASE) with Machine Learning Enhancement to Determine Structure of Nanoparticle Mixtures and Solutions

We present a new open-source, machine learning (ML) enhanced computational method for experimentalists to quickly analyze high-throughput small-angle scattering results from multicomponent nanoparticle mixtures and solutions at varying compositions and concentrations to obtain reconstructed 3D structures of the sample. This new method is an improvement over our original computational reverse-engineering analysis for scattering experiments (CREASE) method (ACS Materials Au2021, 1 (2 (2), ), 140−156), which takes as input the experimental scattering profiles and outputs a 3D visualization and structural characterization (e.g., real space pair-correlation functions, domain sizes, and extent of mixing in binary nanoparticle mixtures) of the nanoparticle mixtures. The new gene-based CREASE method reduces the computational running time by >95% as compared to the original CREASE and performs better in scenarios where the original CREASE method performed poorly. Furthermore, the ML model linking features of nanoparticle solutions (e.g., concentration, nanoparticles’ tendency to aggregate) to a computed scattering profile is generic enough to analyze scattering profiles for nanoparticle solutions at conditions (nanoparticle chemistry and size) beyond those that were used for the ML training. Finally, we demonstrate application of this new gene-based CREASE method for analysis of small-angle X-ray scattering results from a nanoparticle solution with unknown nanoparticle aggregation and small-angle neutron scattering results from a binary nanoparticle assembly with unknown mixing/segregation among the nanoparticles.

spherical nanoparticles; however, this approach could readily be adapted for other types of systems including anisotropic nanoparticles.

I.A.2 Gene-based CREASE method workflow
The family of CREASE methods developed by Jayaraman and coworkers 4, 5 all rely on a genetic algorithm (GA) 6 as the optimization tool to identify the optimal nanoparticles' structure, or a micelle structure, a vesicle structure, etc. The difference between the various CREASE methods is the "genes" that store the information to construct the structure in question. Figure 1 in main paper provides the basic workflow of any CREASE method; for a given input, it Step 1) starts with a population of different individuals with each individual corresponding to a different structure; Step 2) evaluates the "fitness" of each individual in that generation [i.e., how well the computed scattering for each individual Icomp(q) matches the target scattering Itarget(q)]; Step 3) checks if the fitness has converged; 3A) if fitness has not converged, then the cycle continues, produces the next generation of individuals by mutations (random change to an individual's genes to increase population diversity) and crossovers (combination of two individuals' genes to, ideally, produce a new, better individual). Proceed to step 2 and continue the cycle. 3B) if fitness has converged, then the cycle stops, and the CREASE method outputs the optimized structure whose computed scattering best matches with the input experimental/target scattering. In the following sections, we describe the specifics for each step as applied to the new gene-based CREASE method presented in this paper.

Step 1: Generating individuals
For each gene-based CREASE run, the user must specify how many individuals, N, are in a generation. The larger the N, the greater the population diversity but higher the computational resources required to evaluate the fitness of each individual in the population; the opposite is true for a smaller N. We discuss later how using machine learning (ML) can significantly reduce the computational resource needed to calculate fitness per individual which in turn enables having larger N with minimal additional computational intensity. For the Debye equation evaluated genebased CREASE, we set N = 35, and for the ANN evaluated gene-based CREASE, we set N = 105.
Each individual's structure is described using 10 genes: • Genes 1 to 5 are related to the nanoparticle(s) size(s) and composition/concentration. Gene 1 and gene 2 are the nanoparticle type A's average diameter and dispersity, respectively; gene 3 and gene 4 are the nanoparticle type B's average diameter and dispersity, respectively. Gene 5 is the composition/concentration of nanoparticle type A. As mentioned previously, users do not need to provide the exact diameter, dispersity, and composition/concentration, but they provide a range to allow CREASE to converge to the average diameter, dispersity, and composition/concentration. For one-component systems, genes 3 and 4 relate to coarse-grained solvent particles that are present to allow reconstruction of A type nanoparticle structures from disperse to aggregated, and we recommend the user utilize similar values as those used for genes 1-2 as genes 3-4 may not be needed as output for one-component systems.
• Genes 6 to 8 are related to how CREASE produces structures with different degrees of mixing/aggregation. Gene 6 is related to the average nanoparticle domain size; gene 7 is related to the compactness of the domain(s); gene 8 is related to the spacing between domains. • Gene 9 allows CREASE to apply a background scattering intensity to avoid fitting at I (q) values that are below the background. • Gene 10 is only needed for 3D structural reconstruction to do the Debye equation based Icomp(q) calculation. This gene sets the number of nanoparticles (in essence, the system size) for doing the structural reconstruction; larger values (larger system sizes) require more computational resources. The low q regime of the I(q) is sensitive to the system size (with noise in the I(q) at q values close to and below that corresponding to the system size), so the user should check that the system size is large enough to produce negligible/minimal noise over the q range of interest. For this work, we found that setting gene 10 to 20,000 nanoparticles provides sufficient resolution for the scattering intensity and radial distribution function while considering that the computational resource requirement scales as the number of nanoparticles squared. This is one of major advantages of using the ML augmented gene-based CREASE method in that it can completely skip this intensive 3D structure reconstruction and Debye-based computed scattering calculation for every individual! Initially, all N individuals receive random gene values within the upper and lower limits set by the user. Once we have the genes for every individual, in the Debye calculation gene-based CREASE, we then convert the genes into a 3D structure (which is ultimately used to obtain the computed scattering profile for each individual) using the method described in detail in this document in Section VI. The ANN-evaluated gene-based CREASE does not need this reconstruction step until the end to produce the output.

Step 2: Evaluating fitness for every individual
After converting the gene information into a corresponding structure (if using the Debye equation evaluated gene-based CREASE method), we then evaluate the fitness of each individual to quantify how closely the Icomp(q) from each individual matches the Itarget(q). We consider both the direct scattering calculation method using the Debye equation and the more computationally efficient method using ML.
For the Debye equation evaluated gene-based CREASE method, we explicitly calculate the scattering for the 3D structure using the Debye scattering equation: 7 Icomp X(q) is the computed scattering intensity for nanoparticle type X (either nanoparticle A or B) considering all Nx nanoparticles in the structure. We define rij as the pairwise distance between nanoparticles and Vsample is a scaling term to set Icomp X(q) to 1.0 at the lowest q value considered to facilitate comparison between Itarget(q) and Icomp(q). fi is the spherical form factor amplitude for the nanoparticle defined as: 8 ) ( ) = ∆ ) ) 6(./0(28 ( ):28 ( ;<.(28 ( )) We set Vi as the nanoparticle volume, Ri as the nanoparticle radius, and Δρi as the scattering length density (SLD) or electron density difference between the solvent and the nanoparticle. Because we consider generic A and B nanoparticle chemistries, we set Δρi to 1.0 though the value can be S5 adjusted based on the degree of contrast match for a two-component nanoparticle assembly. 5 When analyzing the SAXS profile as input, we must further incorporate the effects of X-ray slit smearing into our Icomp X(q) using the instrument's slit length (Δl) of 0.23969 nm -1 and slit width of ~0 nm - (S3) When analyzing the SANS profile as input, we must further incorporate the effects of neutron pinhole smearing into our Icomp X(q) using the instrument's output variance σ and the mean scattering vector 3. 10 The q range used for the in silico targets is designed to span multiple decades to allow for optimization of both the nanoparticle arrangement (low q) and nanoparticle size and dispersity (high q). For the SAXS and SANS inputs, we use the experimental q range to facilitate direct comparisons to the experimental scattering profile. We note that to properly account for experimental smearing in computational systems, one must calculate the I(q) at q values below (above) the lowest (highest) qexp. Thus, a user must be aware of the need to calculate scattering at even lower q and account for system size effects as discussed in the manuscript. For the ANN evaluated gene-based CREASE variation, we utilize an ANN to predict Icomp(q) directly from each individual's genes. An ANN consists of an input layer, output layer, and at least one hidden layer with each layer potentially containing a different number of fully connected nodes (nodes that connect to all nodes in the previous and proceeding layer) 11 . We optimize the individual nodes' weights and biases used by the activation function to relate the nodes' inputs and outputs using backpropagation to minimize the set loss function (error of the training data and the ANN prediction) 11 . We choose the Rectified Linear Unit (ReLU) as the activation function, the Adam optimization algorithm as the backpropagation optimization method, and the mean squared error of the training data and ANN prediction as the loss function.
For the system of one component nanoparticle solutions, we select an ANN architecture with the following features: a) an output layer with a single node corresponding to the negative log10 of the Icomp(q) at the input q value (negative log10 used to account for logarithmic nature of the scattering profile) b) an input layer consisting of 6 nodes each corresponding to an input for the genes 6, 7, and 8 (related to how CREASE produces structures with different degrees of aggregation), gene 5 (nanoparticle concentration), gene 2 (nanoparticles' size dispersity), and log10 qD/2π where D is the average nanoparticle diameter (gene 1), and q is the wavevector at which we the output Icomp is calculated. The D in qD/2π is in nm for the scattering q value units in nm -1 leading to qD/2π being a dimensionless parameter. This q value normalization enables the ANN to predict Icomp(q) even for systems with nanoparticle diameters it is not trained for as the ANN learns to predict for qD/2π term rather than the specific diameter and q. The log10 value is chosen to not overemphasize ANN training on sections of the normalize q range because the q values are logarithmically spaced (as opposed to linearly spaced).
To determine the optimum ANN architecture, we test a variety of number of hidden layers and number of nodes per layer (details of this are provided in this document Section VII). The training data to test each different ANN architecture is obtained from the Debye equation evaluated gene-based CREASE method with randomly selected genes to link the input layer (genes) to the resulting Icomp(q). We utilize 5,250 training points [genes and corresponding I(q)] because that S6 corresponds to the same computational cost as running the three replicates of the Debye equation evaluated gene-based CREASE method (3 runs * 35 individuals/generation * 50 generations/run). Our work shows that the ANN does not require an excessive quantity of training data or training time. Though the incorporation of a larger training set likely would improve the ANN performance, we demonstrate this approach to minimize the computational intensity required for a user and illustrate that it can be computationally less intensive to develop and utilize an ANN than running the Debye evaluated gene-based CREASE if a user has more than 1 system to analyze. We split the data such that 70% is considered the training set and 30% as the validation set. Using the training data, we test various architectures; we consider all permutations of number of hidden layers (1, 2, and 3) and the number of nodes per layer (8,16,32,64,128, and 256) (see Section VII). We train the ANN until the losses plateau at 500 epochs without any indication of overfitting the training set data. We find that the ANN performance is optimized with a single hidden layer of 256 nodes; the training set and validation set loss for various combinations of ANN architecture are shown in Section VII. Once the ANN has been trained it can be used to calculate the Icomp(q) for each individual. When generating the training data, the system size of the training data will directly impact the lowest q value that will give a reliable I(q). As such, the system size of the training data will likewise affect the lowest q value that the ANN should be trained on.
Regardless of whether one uses the Debye equation or the ANN model for calculating the Icomp(q) of each individual, the next step is to evaluate each individual's fitness. We define the fitness based on the sum of squared errors (SSE) which we set as the χ 2 criteria commonly used in scattering analysis. 9 We set σ as Itarget(q) to equally weight the error contribution from each q point for all in silico Itarget(q), and we set σ as the scattering error output from the scattering instrument when working with scattering results from experiments, Iexp(q). The two component binary systems' total SSE is the addition of the SSE from each contrast matched scattering intensity. While we use the commonly used χ 2 error criteria, other error functions can be incorporated instead based on the user preference or specific system.
Using the SSE, we can define the individual's fitness value. fitness = X (SSEmax -SSE) + Y (S7) = ( − 1) SSEmax is the maximum SSE value for the current generation, and cs is a constant set to 10.

Step 3: Checking convergence of fitness
The GA is terminated once the population fitness converges or the maximum number of generations is reached. If the fitness does not converge even upon reaching the maximum number of generations one should rerun the GA with larger number of generations. Fitness is considered converged for our case if the average and best individual fitness do not significantly improve over the preceding 10 generations. To do this we plot the fitness (or SSE) on the y axis (in log scale) and the generation number on the x axis and check that the fitness or SSE plateaus beyond a certain generation. Once the fitness has converged, CREASE returns the overall most fit structure with its Icomp(q).
For the ANN evaluated gene-based CREASE approach, the Icomp(q) calculation is an order of magnitude faster than using the Debye equation (ESI Figure S5). Despite the speedup one should note that the ANN may not be as accurate in predicting Icomp(q) for a given structure (defined using genes as input) as compared to the Debye equation based Icomp(q) calculation. The accuracy of the ANN can typically be improved by increasing training data size which requires additional computational resources.
To improve accuracy without having to commit too much additional computational resources, we perform the ANN evaluated gene-based CREASE method in a two-step approach. First, we perform the ANN evaluated gene-based CREASE on a smaller q range (q greater than twice the nanoparticle diameter) to only converge to the nanoparticle size distribution which is then given as an input to a second ANN evaluated gene-based CREASE run over the entire q range. This two-step approach results in a better structure reconstruction as each ANN evaluated genebased CREASE run in these two steps have fewer parameters to optimize and converge to, resulting in higher accuracy outputs without requiring a larger training dataset. For systems where the particle size distributions are known a priori and can be given as an input, we only do the second step of ANN evaluated gene-based CREASE run over the entire q range. Either way at the end of the final step, the ANN evaluated gene-based CREASE creates a structure from the 'best' (highest fitness) individual's genes and calculates the Icomp(q) using the Debye equation evaluated method; that 'best' structure's 3D generation and Debye scattering calculation accounts for most of the timing requirement for the ANN evaluated gene-based CREASE method.
Regardless of the method -Debye equation evaluated gene-based CREASE or ANN evaluated gene-based CREASE -if the fitness has not converged by the current generation, the CREASE method needs to create a next generation of individuals whose fitness will (ideally) improve from that in the proceeding generation. The next generation of individuals is produced based on the fitness of the individuals in the current generation. The probability that an individual from the current generation is selected for the next generation is directly proportional to the individual's fitness whereby the more fit individuals are more likely to continue to the next generation. The individuals that are selected for the next generation then have the chance to undergo crossover and/or mutation, two genetic operators that adjust the diversity of genes in the population. Mutations to an individual's genes allows exploration of the gene space to prevent premature convergence in the GA. Crossovers occur to randomly combine the genes of two individuals to, hopefully, help the GA converge to more optimal genes. The probability that an individual undergoes a crossover is PC, and PM is the probability that a mutation occurs. We allow PC and PM to vary over the GA run to balance the population's gene diversity because a homogenous population is unable to explore the gene space to find the optimal gene choices, and an overly diverse population more closely resembles random search as the GA does not improve on good solutions from the previous generation. At the start of a GA run, we set PC = 0.6 and PM = 0.001, and PC and PM are updated based on the reciprocal genetic diversity, GDM, during the GA run. = If GDM is greater than 0.85, the population is not diverse enough, so we increase PM by a factor of 1.1 and decrease PC by a factor of 1.1. On the other hand, if GDM is less than 0.005, the population is overly diverse, so we decrease PM by a factor of 1.1 and increase PC by a factor of 1.1. If GDM is between 0.005 and 0.85, we do not adjust PM or PC for that generation. After selecting N-1 individuals for the next generation and performing the crossovers/mutations, we utilize elitism whereby the most fit individual's genes from the preceding generation are S8 automatically carried over to the next generation. This ensures that the best individual is maintained in the next generation to enrich future individuals with its genes. At this point, the next generation of individuals are set, and the GA cycle repeats until the fitness converges.

I.A.3 In silico structures for obtaining Itarget(q) used for CREASE validation
We validate our gene-based CREASE method by showing that it can produce the right structure when used to analyze Itarget(q) obtained from in silico experiments (i.e., molecular simulations) where the target structure corresponding to the Itarget(q) is known. The in silico experiments contain both binary nanoparticle assemblies and one component nanoparticle solutions to illustrate the versatility of the gene-based CREASE method.
The binary nanoparticle assemblies are produced using molecular dynamics simulations with the details of the protocol described in our earlier work 12 . We consider a range of systems with A-type volume compositions of 25%v and 50%v, weak and medium nanoparticle demixing, and nanoparticle size dispersity of 9% and 20%. We select the specific systems to briefly illustrate that this gene-based CREASE method outperforms the previous original CREASE method 5 while requiring less specific inputs. The one component nanoparticle solution systems are produced by placing nanoparticles to achieve a range of nanoparticle aggregation from disperse to strongly aggregating with nanoparticle concentrations by volume from 10%v to 50%v and nanoparticle size dispersity from 10% to 20%. The one component nanoparticle solution systems target RDF and Itarget(q) is the average of 10 structures with similar characteristics (e.g., nanoparticle aggregation, concentration, and size distribution) to incorporate the variability in the structure. For these in silico systems, we compare the structure returned by CREASE against the target structure using the RDF, and we quantify the RDF match quality by determining the percent error between the CREASE output structure's RDF and the target structure's RDF. For all systems, we perform three independent gene-based CREASE runs and compare the average and standard deviation from the three gene-based CREASE runs against the target structure. All visualizations are created using the VMD software 13 .

I.B.
Experimental methods

I.B.1 Nanoparticle Synthesis
The synthetic melanin nanoparticles (SMP) are made through a previously reported approach using a base-catalyzed, auto-oxidative polymerization of a dopamine monomer (Sigma Aldrich) 14 . The synthesis step consists of the constant stirring of 5 mL dopamine hydrochloride (4 mg/mL), 50 mL deionized water, 20 mL ethanol, and 1.2 mL ammonia (NH4OH; Sigma Aldrich; 28 to 30 wt.%) under ambient conditions for 18 hours then washing using centrifugation. ESI Figure S19 shows the visualization of the SMP nanoparticles using a transmission microscope (JEM-12307, JEOL Ltd.). Silica nanoparticles are made using a modified Stöber process. 15

I.B.2 Spherical Nanoparticle Assembly Formation
The binary nanoparticle assemblies are formed using the previous reverse-emulsion assembly process. 16 The reverse-emulsion assembly process involves adding 30 μl of aqueous solution of silica and melanin nanoparticles (30 mg/mL each; 10.8 μl of silica + 19.2 μl of melanin S9 nanoparticle suspensions for equal volume fraction binary mixture composition) to 1 mL of anhydrous 1-octanol (Sigma-Aldrich) and vortex mixing the mixture at a speed of 1600 rpm for 2 min followed by 1000 rpm for 3 min. The spherically-shaped nanoparticle assemblies are precipitated and dried at 60°C.
To inspect the internal structure of the nanoparticle assembly, the dried nanoparticle assembly powder is placed in an epoxy resin (EMbed 812) within a block mold and cured at 60°C for ~16 hours. The cured block is trimmed using a Leica S6 EM-Trim 2 (Leica Microsystems) to give a sharp trapezoidal tip. 80 nm-thick slices are cut from the trimmed block using a diamond knife (Diatome Ltd.) on a Leica UC7 ultramicrotome. The slices are loaded onto carbon-coated copper grids (FCF300-Cu; Electron Microscopy Sciences) for transmission electron microscopy.

I.B.3 Small angle X-ray Scattering (SAXS) Measurements
We perform SAXS measurements on the dilute SMP suspensions in deionized water using the USAXS instrument on the 9-ID-C beamline at the Advanced Photon Source at Argonne National Laboratory. The X-ray beam energy is 21 keV (corresponding to an X-ray wavelength of 0.5904 Å) with a beam size of 0.8 ´ 0.8 mm 2 . The USAXS data collection time is 90 seconds. The SAXS measurements are performed at ambient temperature. The X-ray beam is monochromatized and collimated using a Bonse−Hart geometry with two channel-cut Si(220) crystals to provide a high degree of monochromaticity and collimation. Ref. 17 provides additional instrument design and configuration details.
We dilute the SMP nanoparticle solutions to a concentration of 10 mg/mL and place the solution in a glass capillary of ~90 mm length, ~1.5-1.8 mm diameter, and ~0.2 mm wall thickness (9530-3 Pyrex Capillary Tubes, Corning, Inc.). We mount the samples using Scotch Magic Tape (The 3M Company) in the instrument capillary holders. The SAXS measurements are corrected for instrument background and empty sample capillary scattering. The Indra package within the Igor Pro 8.04 (Wavemetrics, Inc.) environment performs the raw SAXS data reduction.

I.B.4 Small angle Neutron Scattering (SANS) Measurements
All the SANS experiments were conducted at the National Institute of Standards and Technology Center for Neutron Research.
The standard configurations of the vSANS instrument are used to run the measurements, i.e., the high-q setup used 6 Å neutrons with front and middle detector carriages set at 1.1 m and 5.1 m, respectively, from the sample while the low-q setup used 11 Å neutrons with front and middle detector carriages set at 4.6 m and 18.6 m, respectively, from the sample.
The non-contrast-matched (NCM; total scattering contribution from melanin and silica nanoparticles) is obtained from the nanoparticle assembly suspensions in 100% deuterated 1octanol. The melanin contrast-matched (MCM; scattering contribution from silica nanoparticles) is obtained from the nanoparticle assembly suspensions in 60% deuterated 1-octanol and 40% hydrogenous 1-octanol. The nanoparticle assemblies are held in quartz banjo cells (Product # 120-2mm and 120-1mm, respectively; Hellma USA) to avoid any undesired scattering contribution from the containers. The SANS experiments are performed at ambient temperature and the measured intensities are corrected for background scattering and empty cell contributions. The experimental scattering intensities are also normalized using a reference scattering intensity of a polymer sample of known cross-section. The reduction of raw SANS data is performed following a well-known protocol described by Kline. 18 All the experimental scattering datasets are pinhole smeared. 5 We compare and contrast our new gene-based CREASE method that relies exclusively on genes to generate the nanoparticle structure (focus of manuscript) against our original CREASE method 5 .

II. Discussion comparing presented new gene-based CREASE to the original CREASE
First, we briefly summarize how the original CREASE version functions for a binary mixture of type A and type B nanoparticles referring the reader to the original article 5 for all specifics. The original CREASE begins with multiple individuals with type A and type B particles placed in a starting three-dimensional structure using randomly generated pair-wise interaction strengths (A-A, A-B, and B-B) to generate a variety of initial structure morphologies. This leads to the first generation of the genetic algorithm with individuals that differ in structures (arrangement of type A and B particles from randomly mixed to strongly segregating) and pair-wise interaction strengths. For each individual in the first generation, its structure's 'computed' scattering profile is calculated and compared to the 'target' scattering profile (e.g., input experimental scattering profile). The quantitative match between computed and target scattering profile is related to a 'fitness' value, with a good match leading to a higher fitness value. The probability that an individual is selected to continue to the next generation is defined as the fitness value of the individual divided by the sum of all individuals' fitness values. Individuals selected for the next generation then undergo swaps of A and B particles' positions within their 3D structure; these particle swaps are accepted or rejected based on changes in energy (using Metropolis algorithm 19 ); the changes in energy for each swap is calculated using the pair-wise interaction strengths of that individual. After the particle swap is attempted for that individual, its pair-wise interaction energies are also changed. These steps above are repeated for all individuals in multiple generations until the original CREASE algorithm converges to an individual whose structure has the highest fitness from all generations. Thus, the original CREASE algorithm, similar to other Reverse Monte Carlo (RMC) algorithms, iteratively improves the structure over time (generations). However, the original CREASE improves on traditional RMC algorithms by reducing the number of generations/RMC moves by considering multiple structures at once (multiple individuals per generation) and having the initial structures start with a wide range of type A and B particle mixing to increase the probability that at least one starting structure is similar to the target structure.
We now discuss the significant improvements that the new gene-based CREASE offers compared to the original CREASE with many comparisons applicable to other RMC algorithms. As the name implies, the gene-based CREASE method utilizes genes to generate the nanoparticle structure with the same genes able to reproducibly produce similar nanoparticle structures. By solely using genes to produce nanoparticle structures, the gene-based CREASE enables rapid exploration of the design space (particle mixing, nanoparticle diameter and dispersity, etc.) as the gene-based CREASE does not need to iteratively progress through 3D structures (with stored spatial coordinates of all A and B particles) like we did in the original CREASE. The use of genes also enables the individuals to have different nanoparticle size distributions as the gene-based CREASE decouples the nanoparticles sizes from the structural arrangement allowing both to be tuned.
Further, the gene-based CREASE framework enables the use of machine learning (ML), specifically an artificial neural network (ANN), to rapidly convert the genes corresponding to the S11 structure to a scattering profile bypassing the time-consuming 3D structure generation and Debye calculation. The choice of an ANN as the ML model is not unique, and other ML models, such as a Random Forest Model, can be utilized to perform a similar task as the ANN. The use of ML significantly reduces the analysis time to identify the genes of interest and ultimately the structure with a close scattering match to the Itarget(q). The reduction in computational time for the genebased CREASE makes the method accessible to experimental users without access to highperformance clusters. Finally, for training the ANNs, the normalization of the q values by the nanoparticle diameter expands the applicability of the ML model to nanoparticle sizes and q ranges beyond that which it was initially trained on (as long as the normalized q values are within the range that ANNs were trained for).
Lastly, the new ML augmented gene-based CREASE outperforms the original CREASE in terms of user ease and accessibility as the new method does not require exact nanoparticle size distribution (only an estimated range) and does not require composition/concentration of the nanoparticle solution. We have also made this new method available as an open-source package on github (https://github.com/arthijayaraman-lab/crease_ga). S12 Figure S1: Quantification of original and gene-based CREASE methods' matches on scattering profiles from an in silico nanoparticle mixture assembled in spherical confinement (using simulated method of Ref. 12 ) with weak and medium nanoparticle demixing, a particle diameter of 220 nm with 9% particle size dispersity, and 25%v A-type composition. In all plots the original CREASE method result is in red, the gene-based CREASE method with the particle size distribution input in blue, and the gene-based CREASE method with the particle size distribution not input in purple. In a) we provide the average 2 error as the standard deviation is negligible to the precision provided; in b) and c) for the CREASE methods, we provide the average and standard deviation from 3 independent CREASE runs; the primary particle peak is marked by the dashed gray line, and the primary peak error has a gray arrow directing the user to it. Figure S1. We quantify the closeness of the scattering match between Icomp(q) and Itarget(q) using the χ 2 value commonly used in scattering analysis. 9

Discussion of results in
S11) We set σ as Itarget(q) to equally weight the error contribution from each q point for all in silico Itarget(q), and we set σ as the scattering error output from the scattering instrument when working with scattering results from experiments Iexp(q). The two component binary systems' total * is the combination of the * from each contrast matched scattering intensity. A perfect scattering match between Icomp(q) and Itarget(q) has a χ 2 of 0.0, and a poor scattering match has a large χ 2 . Thus, a low χ 2 is desirable to illustrate a close scattering profile match. For this work, we have found that a strong scattering match tends to occur for χ 2 < 1.
To quantify the local structure, we perform the radial distribution function (RDF) calculation, and while an RDF comparison between the target structure and original/new CREASE structure itself is a quantitative metric, we also calculate the RDF percent error to further quantify the closeness of the RDF match. We define the RDF percent error (also called ∆ % RDF) as: The ∆ % RDF includes a division by the target RDF value, so when the target RDF value is small (typically small r distances; before the first primary RDF peak), minor differences between the target and CREASE RDF values can be excessively magnified. For example, if the target RDF is 0.01 and the CREASE output structure RDF is 0.02, then the percent error is 100% [(0.02-0.01)/0.01*100] while the RDF difference (0.01) is small. This issue is especially relevant for ESI Figures S6-S11. The lower the ∆ % RDF, the better the RDF match. Figure S1 for medium nanoparticle demixing, 50%v A-type composition, and particle diameter of 220 nm with 9% or 20% particle size dispersity.   Figure S3: Similar as Figure 3

in main manuscript; this figure is for 50%v A-type nanoparticle, weak demixing of A and B nanoparticles, average nanoparticle diameter of 220 nm with 9% or 20% particle size dispersity (log-normal distribution) for both A and B nanoparticles.
20% dispersity 200 400 600 RDF BB 9% particle dispersity RDF AB 9% particle dispersity RDF AA 9% particle dispersity   Figure S4: Same as Figure S1 and S2; this figure is for weak nanoparticle demixing, 50%v Atype composition, and particle diameter of 220 nm with 9% or 20% particle size dispersity.

Figure S5. Time in minutes between the artificial neural network (ANN) evaluated gene-based CREASE compared to the Debye equation evaluated gene-based CREASE. The purple data point is the required time to perform one run (replicate) of the Debye equation evaluated gene-based CREASE (a run has 35 individuals/generation and is run for 50 generations at which point the optimum structure is achieved). The blue data point is the time required to obtain (or collect) the data used to train the ANN. The red data point represents the time required to train the ANN using the data obtained (or collected). The time represented by the red and blue data points is a onetime commitment and do not need to be repeated as long as the system of interest (e.g., normalized q, dispersity of particle sizes) is within the range of training data. The black data point is the time to perform a single run (replicate) of the ANN evaluated gene-based CREASE (a run contains 105 individuals/generation and is run for 50 generations by which point the optimum structure is achieved). As discussed in the Methods, 105 individuals per generation are used in the ANN evaluated gene-based CREASE because the ANN enables evaluating more individuals per generation (reducing likelihood of premature structure convergence) for a lower computational intensity compared to the Debye equation evaluated gene-based CREASE.
The comparison of black and purple symbols in Figure S5 show the significant speed-up achieved with ANN evaluated gene-based CREASE over the Debye equation evaluation gene-based CREASE. The timing benchmarks provided here are based on runs conducted on a dual-socketed Intel "Broadwell" 18-core processor.

IV. Gene-based CREASE validation on one-component nanoparticle solution with various nanoparticle aggregation and concentration
For ESI Figures S6-S11, we provide both the nanoparticle RDF as well as the percent error of the CREASE RDF to the target RDF (Δ % RDF). As previously discussed, the percent error at very small distances (below the first primary RDF peak) can be excessive because the target RDF value is very small ensuring even minor differences are inflated. Thus, ESI Figures S6-S11 provide a truncated RDF percent error plot directly below each RDF plot with the y axis set to at most [-50,50] to demonstrate the close match over the majority of the RDF calculation. ESI Figures S6-S11 include the full RDF percent error plot below the truncated plot to provide the curious reader with all information.    Figure S7: Same as Figure S6; this figure is for 30%v concentration and nanoparticle diameter of 220 nm with 10% particle size dispersity.  Figure S8: Same as Figure S6; this figure is for 50%v concentration and nanoparticle diameter of 220 nm with 10% particle size dispersity.  Figure S9: Same as Figure S6; this figure is for 10%v concentration and nanoparticle diameter of 220 nm with 20% particle size dispersity.  Figure S10: Same as Figure S6; this figure is for 50%v concentration and nanoparticle diameter of 220 nm with 20% particle size dispersity.       Figure S13: Same as Figure S6; this figure is for 50%v concentration and nanoparticle diameter of 220 nm with 20% particle size dispersity.     In the Debye equation evaluated gene-based CREASE, to calculate the computed scattering profile for each individual, we need to convert that individual's genes to a 3D structure. In the ANN evaluated gene-based CREASE we do not need to do this construction of the 3D structure until the end of the GA when we are providing the output for the 'best' individual. We note that this gene-based representation is not lossless as a set of similar structures could be represented by the same or a very similar set of genes. Because of the inherent randomness designed into the gene set to structure conversion, the same set of genes converted to a structure twice will not give the exact same structure. Instead, the structures will be similar with similar RDFs and scattering. This enables the generation of an ensemble of similar structures rather than solely generating the exact same structure each time.

error as the standard deviation is negligible to the precision provided; in parts b) and c) for the CREASE methods, we provide the average and standard deviation from 3 independent CREASE runs (standard deviation omitted if particle size distribution was input)
To start the process of the 3D structure construction, we place the gene 10 number of nanoparticles (for our case, ~20,000) into a cubic lattice grid pattern with the separation between particles set as the largest nanoparticle diameter, based on average diameter and the dispersity (genes 1-4). All particles are initially considered type A nanoparticles (two-component system) or solvent particles (one-component system). Using gene 5 (composition/concentration), we identify Nswap, the number of particles that must be changed to type B (two-component system) or NPs (one-component nanoparticle system). If Nswap is greater than one, we randomly select one of the type A/solvent particles and change its identity to type B/NPs, and we say that this swapped nanoparticle begins the current nanoparticle domain. We then undergo an iterative cycle to select the remaining Nswap -1 particles to swap identities. We generate a random number between [0, 1], and if the random number is less than or equal to gene 6, the current domain grows. A large value of gene 6 ensures, on average, large particle domains, and a small gene 6 value corresponds to small particle domains, on average. If the current domain grows, we then randomly select a type

Yes No
A/solvent neighbor from all possible type A/solvent neighbors of the current domain. The set of neighbors is comprised of all type A/solvent particles directly adjacent (in the cubic lattice, the cube faces are touching) to the type B/NPs particles in the current domain. For the randomly selected neighbor, we calculate its neighbor ratio of type A/solvent neighbors to type B/NPs neighbors. We generate another random number between [0, 1], and if the random number is less than or equal to one minus the absolute value of the quantity of gene 7 minus the neighbor ratio, that neighbor is selected as type B/NPs. If the randomly selected neighbor fails the check, the process is repeated by randomly selecting another neighbor from the set of all neighbors of the current domain. A large value of gene 7 corresponds to a more compact domain, and a small gene 7 value results in an elongated domain. Finally, if the original check against gene 6 fails, the current domain does not grow, and a new domain is started. To start the new domain, a random type A/solvent particle is chosen as a potential starting point. Similar to what we did with gene 7, we determine the neighbor ratio of the potential new domain starting point (type A/solvent neighbors to type B/NPs neighbors). We generate another random number between [0, 1], and if the random number is less than or equal to one minus the absolute value of the quantity of gene 8 minus the neighbor ratio, then that particle is selected as type B/NPs and becomes the starting point of the new domain. A small gene 8 selects for a starting point not neighboring type B/NPs, and a large gene 8 attempts to start a new domain in a spot with many neighboring type B/NPs. At any point during this process, if one of the neighbors of the current domain is a type B/NPs particle in a different domain, the previous domain is considered part of the current domain as the domains have merged.
After selecting the Nswap particles, we randomly assign the nanoparticle sizes for each chemistry (two-component) or the nanoparticle and coarse-grained solvent (one-component) to mimic the particle size distributions from genes 1-4. The polydispersity in each nanoparticle chemistry size is included by discretizing the lognormal distribution into 11 groups of particles with each group's diameter selected from the lognormal distribution. We randomly assign the nanoparticles in each of the groups to not artificially introduce size segregation where smaller/larger sized particles are localized together; however, a user can easily adjust this if their specific system exhibits size segregation. Next, we expand the cubic lattice to an initial occupied volume fraction, η, of 0.25 in a simulation box before bringing the particles into a close-packed structure with a final η of 0.55-0.6 (depending on particle size dispersity). We perform a similar simulation methodology as we used in our original CREASE work to generate a close-packed structure. 5 In summary, we first utilize a conjugate gradient (CG) energy minimization technique 20 implement in the LAMMPS software package 21 by applying a restoring potential to all particles based on their spatial location such that particles furthest from the center of the box incur the highest added energy. All particles interact through the colloid Lennard-Jones (cLJ) potential with a weak Hamaker constant set to 0.1 kBT to disallow particle overlap. After performing the conjugate gradient energy minimization, the structures undergo a short molecular dynamics (MD) simulation in the NVT ensemble at T* = 1.0 to help further improve particle packing particularly at high particle dispersity. The MD simulation is run with a timestep size of 0.004 τ for approximately 50,000 timesteps. We would like to caution the user that the CG and MD settings we use are suitable for the system size and nanoparticle sizes considered in this work; however, they likely are not universally applicable for any arbitrary system of interest. We encourage the S31 user to check their final structures to ensure they have achieved a sufficient packing fraction for their application. The user should consider reducing the strength of the restoring potential, reducing the timestep size, and increasing the number of MD simulation timesteps as different potential avenues to improve the packing fraction for their specific system.