Fast Multichannel Inverse Design through Augmented Partial Factorization

Computer-automated design and discovery have led to high-performance nanophotonic devices with diverse functionalities. However, massively multichannel systems such as metasurfaces controlling many incident angles and photonic-circuit components coupling many waveguide modes still present a challenge. Conventional methods require Min forward simulations and Min adjoint simulations—2Min simulations in total—to compute the objective function and its gradient for a design involving the response to Min input channels. Here, we develop a formalism that uses the recently proposed augmented partial factorization method to obtain both the objective function and its gradient for a massively multichannel system in a single or a few simulations, achieving over 2 orders of magnitude speedup and reduced memory usage. We use this method to inverse design a metasurface beam splitter that separates the incident light to the target diffraction orders for all incident angles of interest, a key component of the dot projector for 3D sensing. This formalism enables efficient inverse design for a wide range of multichannel optical systems.


I. INTRODUCTION
Nanoengineered photonic devices can realize versatile and high-performance functionalities in a compact footprint, expanding the limited scope of conventional optical components.Computer-automated inverse design [1][2][3][4][5][6][7] can search a high-dimensional parameter space to discover optimal structures that outperform manual designs or realize new functionalities.With inverse design, the photonic structure is updated iteratively to optimize an objective function f that encapsulates the desired properties.Given the many parameters p = {p k } used to parametrize the design, efficient optimization typically requires the gradient ∇ p f of the objective function with respect to all parameters.The computational efficiency is a critical consideration since full-wave simulations are necessary to model the complex light-matter interactions at the subwavelength scale, and numerous simulations are needed for the many iterations of the search process.
We recently proposed the "augmented partial factor- * cwhsu@usc.eduization" (APF) method, which can solve multi-input electromagnetic forward problems in one shot, offering substantial speed-up and memory usage reduction compared to existing methods [33].But the inverse problem was unsolved since the formalism of Ref. [33] does not yield the gradient.Here, we generalize APF to enable efficient gradient computation for multi-input problems.We are able to obtain both f and ∇ p f in a single or a few computations without a loop over the individual input channels.
For a 1-mm-wide metasurface with 2400 inputs, APF achieves ∼150 times speed-up and ∼30% memory usage reduction compared to the conventional adjoint method.
As an example problem, we inverse design a metasurface beam splitter that splits the incident light equally into the ±1 diffraction orders over an incident angular range of 60 • .

II. MULTI-INPUT GRADIENT COMPUTATION USING APF
The novelty of the APF method lies in encapsulating the linear response of the multi-channel system in a generalized scattering matrix S = CA −1 B − D and then computing the entire S in a single shot through the partial factorization of an augmented matrix K = [A, B; C, D] that yields its Schur complement −CA −1 B [33].Here, the S nm element of the M out × M in matrix S is the field amplitude in output channel n given an input in channel m at frequency ω.Matrix A is the discretized Maxwell differential operator −(ω/c) 2 ε r (r)+∇×∇× of the structure defined by its relative permittivity profile ε r (r), the M in columns of matrix B contain the M in distinct input source profiles, the M out rows of matrix C contain the M out output projection profiles, and matrix D subtracts the baseline contribution from the incident field; they are schematically shown in Fig. 1a.If the number of nonzero elements in matrices B, C, and S are less than the number of nonzero elements in matrix A, the single-shot computing time and memory usage of APF will be independent of how many input and output channels there are [33].In the following, we use APF with finite-difference discretization implemented in our open-source software MESTI [34], using the MUMPS package for factorization [35].
Here, we derive a general formulation such that the gradient of any multi-channel objective function can also be computed in a single shot regardless of the number M in of input channels.We use vector p = [p 1 , • • • , p Np ] to denote the N p real-valued variables parameterizing the photonic design; in the example later, {p k } will be the edge positions of the ridges of a metasurface.The objective function f (also called the figure of merit, FoM) that evaluates the performance of the multi-channel device is a function of the generalized scattering matrix S and the parameters p, namely f [S(p), p].The gradient df /dp k we want follows from the chain rule as Note that S nm is complex-valued, and ∂f /∂S nm is a Wirtinger derivative.Both ∂f /∂p k and ∂f /∂S nm can be calculated analytically given the definition of the objective function f for a specific problem.So, the simulations only need to evaluate the derivative of the scattering matrix, ∂S nm /∂p k .
The parameters {p k } modify the scattering matrix S by modifying the photonic structure ε r (r) in the Maxwell operator A.
Taking the derivative of S = CA −1 B − D and using the identity Here we assume that matrices B, C, and D do not depend on {p k }; additional terms can be added if there is such dependence.This generalizes the adjoint method to multi-channel systems: the columns of A −1 B correspond to M in forward problems, and the rows of CA −1 correspond to M out adjoint problems.To recover the conventional adjoint method, one can substitute ∂S/∂p k into Eq.( 1) and sum over the output channels for each input, which converts the M out adjoint problems to M in adjoint problems.
A key observation is that the derivative ∂A/∂p k above is a low-rank matrix since only a few elements of A depend on the parameter p k .For example, with 2D transverse magnetic (TM) waves at angular frequency ω, matrix A is the discretized version of −∇ 2 − (ω/c) 2 ε r (r), and ∂A/∂p k is zero everywhere except for the few diagonal elements corresponding to pixels where the relative permittivity profile ε r (r) is modified by p k .Matrix ∂A/∂p k is also symmetric due to reciprocity.Therefore, we can do a symmetric singular value decomposition where Σ k is an r k -by-r k diagonal matrix containing the singular values, r k is the rank of ∂A/∂p k , the r k columns of U k are the left-singular vectors (which are real-valued and equal the right-singular vectors), and T stands for matrix transpose.These singular vectors are zero everywhere except at the pixels where ε r (r) is modified by p k .
We then obtain the derivative of the scattering matrix with respect to the k-th parameter as shown in Fig. 1b.To obtain the complete gradient with respect to all N p parameters {p k }, we combine the singular-vector matrices as with APF would yield the complete gradient through Eqs.(1-3) with just two APF computations.As shown in Fig. 1c, we can further reduce from two APF computations to one by building new matrices B = [B, U] and C = [C; U T ] and using APF to compute Here, the matrix K = A, B; C, 0 is augmented with not only the original M in input and M out output channel profiles B and C but also M p additional inputs/outputs being the singular vectors U and U T from the design parameters {p k }.This way, a single-shot APF computation solves all of the M in forward simulations (yielding the scattering matrix S from CA −1 B) and also obtains the complete gradient (i.e., ∂S/∂p k and df /dp k for all p k , from CA −1 U and U T A −1 B) at the same time.
While a single APF computation suffices, doing so is not necessarily the most computationally efficient.When the number of elements in matrix S, numel( S) = (M in + M p )(M out + M p ), is less than the number of nonzero elements in the Maxwell operator matrix A, the APF computing time and memory usage are independent of how many inputs and outputs there are (including the singular vectors U), and we can include as many inputs/outputs as we want in a single APF computation.But when numel( S) > nnz(A), the APF computing time and memory usage will grow linearly with numel( S) [33]; this may be the case here since topology optimization often includes a large number of parameters.We can mitigate this increase by avoiding the computation of the M p -by-M p matrix U T A −1 U in Eq. ( 4) [gray-shaded area in Fig. 1c], which is not needed for either the scattering matrix or the gradient.As illustrated in Fig. 1d, we do so by dividing the singular-vector matrix and separating the single APF computation into N sub sub-APF computations, each operating on smaller matrices . This way, only 1/N sub of the unnecessary matrix [areas shaded in red, green, and blue in Fig. 1d] is computed, reducing computing time and memory usage.To minimize memory, one can choose N sub to reduce numel( S) ≈ [M in + (M p /N sub )] 2 of each sub-APF computation to the order of magnitude of nnz(A).One may merge the output of the N sub computations to obtain but there is no such need: we can directly apply Eq. ( 3) onto This formalism for computing the gradient of a multichannel objective function is very general.It applies to any dimension, polarization, discretization scheme, any type of input source profiles and output projection profiles, with any objective function and any set of design variables.
As a concrete example, we consider full-wave modeling of the TM fields of a 1200-ridge aperiodic metasurface in 2D with mirror symmetry regarding its central plane (W = 1200λ wide, L = 0.6λ thick, discretized with grid size ∆x = λ/40 where λ is the wavelength), computing its full transmission matrix with M in = M out = 2W/λ = 2400 plane-wave channels on each side and the gradient of the transmission matrix with respect to N p = 1200 parameters being the edge positions of the ridges within half of the metasurface.Here, nnz(A) ≈ 1.6 × 10 7 , and M p = 57600.Compared to the conventional adjoint method, APF with N sub = 15 reduces the gradient evaluation time from 1354 minutes to 9.5 minutes and the memory usage from 10 GiB to 7 GiB.Here, the conventional adjoint simulations are also performed with software MESTI [34], which is already optimized by using the efficient MUMPS package [35], utilizing the symmetry of the Maxwell operator matrix A and the sparsity of the input profiles.All computations run on a single core on an Intel Xeon 2640v4 node.Details are provided in Sec. 1 of the Supplement 1, Table S1 shows the breakdown of the computing times, and Fig. S1 plots the dependence on N sub .We have made our gradient computation code open-source [36], including both the APF version and the conventional adjoint version.

III. INVERSE DESIGN OF A BROAD-ANGLE METASURFACE BEAM SPLITTER
As an example, here we inverse design a 2D metasurface beam splitter for TM polarization, composed of ridges with different widths, shown in Fig. 2a.We want the metasurface to split the incident light equally into the ±1 diffraction orders for any incident angle θ in within a 2θ max in angular range.Such a broad-angle beam splitter can be used with a vertical-cavity surface-emitting laser (VCSEL) array and a microlens array as a dot projector to generate structured illumination useful for structured illumination microscopy [37,38], 3D endoscopy [39,40], and 3D sensing (e.g., facial recognition [41], motion detection [42]).The microlens array collimates the output from the VCSEL array, and the beam splitter increases the number of dots.VCSEL arrays are widely used for dot projectors due to their uniform intensity pattern, high power density, low cost, and simple packaging.However, existing beam splitters based on Dammann gratings [43,44] or metasurfaces [45][46][47] only operate on normal incident light and not the oblique incidence from the off-axis VCSEL units.
We let the metasurface be periodic with the width of one period being W = 24 µm, which couples transverse angular momenta k y differing by integer multiples of 2π/W .Within the width W , we place N = 80 amorphous-silicon ridges (n ridge = 3.70) with height h = 0.56 µm and varying widths and positions, sitting on a silica (n sub = 1.45) substrate and surrounded by air (n air = 1), as shown in Fig. 2b,c.The ridge height ensures a sufficient 2π range of phase shifts when varying the ridge width (Supplementary Sec. 2 and Fig. S2).Since the desired response is symmetric, we let the struc-ture be mirror symmetric with respect to a central plane at y = 0 (black dot-dashed line).The operating wavelength is λ = 940 nm, typical for VCSELs.The angular range is 2θ max in = 60 • , typical for dot projectors.To inverse design this broad-angle beam splitter, we minimize the following FoM nm,target )T * nm with * standing for complex conjugation.Here, nnz(A) ≈ 3.3 × 10 5 , and M p = 3840.Using the APF method above, each objective-plus-gradient evaluation with N sub = 3 takes 1.6 seconds while using 0.2 GiB of memory when running on one core.To validate that there is no mistake in our derivation and implementation, we show in Fig. S3 that the gradient computed with APF agrees with a brute-force finite-difference evaluation of the FoM in Eq. (5).
To minimize the FoM, we use gradient-based algorithms to update optimization variables p along the opposite direction of the gradient ∇ p f .The optimizations stop when the change of f is less than f abs tol = 10 −4 .Af- ter comparing four different algorithms (Supplementary Sec. 4 and Fig. S4), we choose the sequential least-squares quadratic programming (SLSQP) algorithm [48] implemented in the open-source package NLopt [49] since it typically converges the fastest and often to a lower local minimum.During the optimization, the separation between neighboring edges (both the ridge width and the spacing between ridges) is constrained to be at least 40 nm to ensure fabrication feasibility.
We find that randomly sampled configurations of the parameter p have a poor performance with the FoM narrowly distributed between 10 and 20, but SLSQP optimization using those configurations as the initial guess leads to a wide distribution of the optimized FoM (Fig. 3).Since this inverse-design problem is non-convex, there is a sensitive dependence on the initial guess.To find a good final design, we run SLSQP optimizations with 1000 different initial guesses.
Figure 2c,d shows the initial configuration, the final configuration, their corresponding transmission matrices, and the evolution of the FoM for the best case among these optimizations.The optimized metasurface exhibits uniform and near-perfect beam splitting for all incident angles within the 60 • angular range of interest.Here, |T nm | 2 averages to 0.4 at the ±1 diffraction orders and to 0.003 away from these angles.Supplementary Video 1 shows how the configuration and the transmission matrix evolve, and Table S2 lists the parameters of the final configuration.
In Sec. 6 and Figs.S6-S7 of the Supplement 1, we consider optimizations where we do not impose a mirror symmetry in the structure.While such a setup is more general, it has a larger-than-necessary design space and yields slightly less optimal results.

IV. OUTLOOK
Computer-automated design and discovery unlock numerous possibilities, and the formalism proposed here can be the enabling element for inverse design on a wide range of multi-channel optical systems mentioned in the introduction.Instead of the N sub matrix division employed here, future work may also explore computing the Schur complement of a rectangular augmented matrix to avoid computing the unnecessary matrix U T A −1 U. Advances in the computation method, together with open-source codes, can deliver the next generation of photonic devices.

Fast multi-channel inverse design through augmented partial factorization: Supplemental document SHIYU LI, HO-CHUN LIN, CHIA WEI HSU
This document provides supporting information to "Fast multi-channel inverse design through augmented partial factorization".It consists of six sections.Section 1 provides benchmarks on gradient computation using augmented partial factorization (APF) and conventional adjoint method.Section 2 gives the amplitude and phase maps of meta-atoms over a wide incident angular range.Section 3 compares the gradient obtained from APF and finite difference.Section 4 compares the performance of different optimization algorithms.In Sec. 5, we use an animation to show how the metasurface and its transmission matrix evolve as the optimization goes on.In Sec. 6, we design broad-angle beam splitters using metasurfaces without mirror symmetry.

BENCHMARKS ON GRADIENT COMPUTATION
As an example, here we compare the computing time and memory usage of f only, and f -plus-∇ p f computation for a 1-mmwide metasurface, using the augmented partial factorization (APF) [1] and conventional adjoint method.The metasurface (1200λ wide) contains 1200 amorphous-silicon ridges with height 0.6λ sitting on a silica substrate; it has mirror symmetry with respect to the central plane such that the N p = 1200 edge positions of ridges on the left-half side of the metasurface can be changed independently.We evaluate the full-wave responses with M in = M out = 2W/λ = 2400 inputs and outputs using the open-source software MESTI [2].For the APF method, we build matrices B and C using channels from both two sides of the metasurface and retrieve responses of desirable channels after the partial factorization to take full advantage of the symmetry of the augmented matrix K = [A, B; C, D] and reduce the computing time and memory usage.Note that here nnz(B) ≈ 2.3 × 10 8 exceeds nnz(A) ≈ 1.6 × 10 7 , thus we compress B and C to reduce the number of nonzero elements such that nnz(B) < nnz(A) and the computational cost does not grow with M in [1].Specifically, we pad 480 extra channels to the input/output profiles, weight them by the Hann window function, Fourier transform them to the spatial domain, and then truncate with a 3λ window.After performing the partial factorization, the output is decompressed by undoing the above process, with a negligible error of ∼ 10 −4 from compression.
For the conventional method, we factorize matrix A = LU, solve A −1 B m for each input B m and project it with C if only f is needed.To compute the gradient using adjoint method, we perform the forward simulation (solve A −1 B m ), and the successive adjoint simulation with source for each input B m .The gradient is obtained after a loop over all M in inputs: The gradient computation code with these two methods implemented is open-source on Ref. [3].All computations are performed using a single core on an Intel Xeon 2640v4 node.Each case is computed 10 times, with the average computing time and the maximal memory usage after subtracting the 0.9 GiB memory used by MATLAB R2022a recorded.As shown in Tab.S1, factorization and solving are the most time-consuming stages for the APF and conventional method, respectively.Extra timing needed to obtain f and ∇ p f from the full-wave simulations is counted in post-processing.APF outperforms the conventional method by ∼2000 fold in speed and ∼2.5 fold in memory for the f -only case, with the memory usage difference coming from the compression of B and C in APF.When gradient is involved, APF can still outperform the conventional adjoint method in both speed and memory by separating a single computation into N sub ones, as shown in Fig. S1 with N sub ≥ 3 limited by the memory we have access to.One can choose N sub to match their major concerning.As presented in Tab.S1, APF with N sub = 15 achieves a ∼150 times speed-up and a ∼30% memory usage reduction compared to the conventional adjoint method.

META-ATOMS OF THE METASURFACE
Figure S2 shows that ridges with height h = 0.56 µm can provide a 2π range of phase shift with high transmission over the 60 • angular range of interest by changing their widths.We obtain the response of each unit cell with full-wave simulations using APF method implemented in MESTI [2], adopting Bloch periodic boundary in y and 20 pixels of perfectly matched layers (PMLs) in z to mimic a periodic array of meta-atoms with a fixed periodicity of Λ = 300 nm.Then the phase and amplitude of the zeroth-order transmission coefficient are mapped out for different ridge widths and incident angles.

GRADIENT VALIDATION
To validate the gradient derivation and implementation, Figure S3 compares the gradient computed with APF to finite-difference evaluation for a metasurface composed of N = 80 ridges described in Fig. 2b,c of the main text, with the FoM f defined in Eq. ( 5).The transmission matrix T and FoM f are evaluated at p 0 and p 0 + ∆p, where ∆p is a random increment of design parameters.Taylor expansion shows that the l 2 -norm ||∆T − ∇ p T(p 0 + ∆p/2) • ∆p|| with ∆T = T(p 0 + ∆p) − T(p 0 ) should scale as O(||∆p|| 3 ) when ||∆p|| is small, if the gradient ∇ p T(p 0 + ∆p/2) is accurately computed; so does the difference  Different well-developed gradient-based algorithms can be used to optimize the structure parameters {p k } = {y 1 , • • • , y N } and minimize the FoM. Figure S4 compares how the FoM evolves as the optimization goes on using different algorithms, starting from the same initial guess.
An intuitive choice is to update p along the opposite direction of ∇ p f by an adjustable step as p n+1 = p n − γ n ∇ p f (p n ), where γ n is the learning rate at the n-th iteration; n represents the iteration number.The learning rate γ n is chosen using the "backtracking line search" method [4,5] so as not to overshoot (when γ n is too large) or to slow down convergence (when γ n is too small).Here in Fig. S4, we start from a relatively large estimate of the learning rate γ max = 2 × 10 −4 µm 2 for movement along the −∇ p f direction, and iteratively shrink the learning rate by a factor of τ = 1/2 until the Armijo condition [6] is satisfied: where c 1 = 10 −2 .We also try the sequential least-squares quadratic programming (SLSQP) algorithm [7] and method of moving asymptotes (MMA) [8] implemented in the open-source package NLopt [9], and interior point algorithm [10][11][12] implemented in the fmincon function in Matlab; all of them support inequality constraints.As shown in Fig. S4, SLSQP converges faster and to a lower level among all these algorithms, and thus is adopted for all optimizations in this work.
Note that although the gradient −∇ p f points out the update direction, one still needs intermediate steps to decide the proper update size.A large update may cause overshooting; this is what happens at some iterations for interior point algorithm, MMA and SLSQP in Fig. S4.In the backtracking line search algorithm, we skip intermediate steps and only record evaluations that reduce the FoM, so the curve keeps going down.

FIG. 1 .
FIG. 1. Schematics of using augmented partial factorization (APF) for (a) forward simulation and (b-d) gradient computation.(a) Illustration of S = CA −1 B − D, where the multi-channel response is encapsulated in a generalized scattering matrix S given in terms of the discretized Maxwell operator matrix A, the source profiles B, the projection profiles C, and the baseline D from the incident field.Dots represent nonzero elements of these sparse matrices, with colors corresponding to different regions of the simulation domain.PML: perfectly matched layer.(b) Illustration of Eqs.(2-3) for a single derivative: after decomposing ∂A/∂p k into U k Σ k U T k and regrouping the terms, CA −1 U k and U T k A −1 B can be computed to yield ∂S/∂p k .(c) Illustration of Eq. (4) for the full gradient: by combining U k for all parameters {p k } and augmenting them to B and C, a single APF computation can yield the complete gradient with respect to all parameters.(d) By dividing U and performing N sub separate APF computations (N sub = 3 in this illustration), we can obtain the same CA −1 B, CA −1 U, and U T A −1 B with less computing time and memory by evaluating only 1/N sub of the unnecessary matrix U T A −1 U and by storing smaller matrices CA −1 U (n) and U T (n) A −1 B in memory.

FIG. 2 .
FIG. 2. Inverse design of a broad-angle metasurface beam splitter.(a) The desired broad-angle beam-splitting response.(b) Parameter definition and mirror symmetry of the structure.(c) Metasurfaces before and after optimization.(d) Evolution of the FoM of Eq. (5) during optimization and the squared amplitude of the transmission matrices before and after optimization.

2 , ( 5 )
where T nm is the transmission coefficient from the mth incident angle to the n-th outgoing angle.Here,p = {p k } = {y 1 , • • • , y N }parameterize the two edge positions of the N/2 ridges within half a period of the metasurface [encircled by the red box in Fig. 2b], with {y N +1 , • • • , y 2N } = −{y N , • • • , y 1 } based on symmetry.The target transmission is T 2 nm,target = 0.5 at the ±1 diffraction orders and T 2 nm,target = 0 otherwise.Equation (5) sums over all incident angles within the angular range of interest [i.e., |k m y | < (2π/λ) sin θ max in with 2π/W spacing] and all outgoing angles.With W = 24 µm, we have M in = 25 input channels within the 60 • angular range and M out = 51 output channels.For this specific FoM, ∂f /∂p k = 0 and ∂f /∂T nm = 2(|T nm | 2 − T 2

Fig. S1 .
Fig. S1.(a) The size of matrix S compared to the number of nonzero elements in matrix A when one objective-plus-gradient APF computation is divided into N sub sub-ones for a 1-mm-wide metasurface.The (b) total computing time and (c) memory usage as a function of N sub . .

Fig. S4 .
Fig.S4.The FoM evolves as the optimization goes on using different algorithms.

Fig. S5 .
Fig. S5.One frame of Supplementary Video 1, which shows how the metasurface, its transmission matrix and the FoM evolve as the optimization goes on.(a) Metasurface and (c) transmission matrix |T| 2 at the 50-th iteration.(b) Evolution of the FoM f during the first 50 iterations.

Fig. S6 .Fig. S7 .
Fig. S6.Schematic of (a) a general metasurface without mirror symmetry.(b) Metasurfaces before and after optimization.(c) Evolution of the FoM f during optimization and the squared amplitude of transmission matrices before and after optimization.