Introducing SpectraFit: An Open-Source Tool for Interactive Spectral Analysis

In chemistry, analyzing spectra through peak fitting is a crucial task that helps scientists extract useful quantitative information about a sample’s chemical composition or electronic structure. To make this process more efficient, we have developed a new open-source software tool called SpectraFit. This tool allows users to perform quick data fitting using expressions of distribution and linear functions through the command line interface (CLI) or Jupyter Notebook, which can run on Linux, Windows, and MacOS, as well as in a Docker container. As part of our commitment to good scientific practice, we have introduced an output file-locking system to ensure the accuracy and consistency of information. This system collects input data, results data, and the initial fitting model in a single file, promoting transparency, reproducibility, collaboration, and innovation. To demonstrate SpectraFit’s user-friendly interface and the advantages of its output file-locking system, we are focusing on a series of previously published iron–sulfur dimers and their XAS spectra. We will show how to analyze the XAS spectra via CLI and in a Jupyter Notebook by simultaneously fitting multiple data sets using SpectraFit. Additionally, we will demonstrate how SpectraFit can be used as a black box and white box solution, allowing users to apply their own algorithms to engineer the data further. This publication, along with its Supporting Information and the Jupyter Notebook, serves as a tutorial to guide users through each step of the process. SpectraFit will streamline the peak fitting process and provide a convenient, standardized platform for users to share fitting models, which we hope will improve transparency and reproducibility in the field of spectroscopy.


Prompt Output
The prompt outputs of Figure 2 and the relaxed model Figure S1 are presented in Figures S4  and S5, respectively.As mentioned earlier, there may be instances where it is not possible to obtain confidence intervals despite successfully calculating uncertainties.Hence, Figure S4 lacks confidence intervals.However, in the case of the proposed model of Figure S2, the corresponding table is highlighted in Figure S5.5, which is displayed in the user's terminal.The printout always includes six types of tables, such as descriptive statistics, fit statistics, variables with optional errors, correlation of each model component, general correlation of each fit variable, and the regression metric.Additionally, the confidence intervals can be provided as an optional table in the terminal for the user.

Correlation:
The linear correlation is a helpful technique for SpectraFit users as it enables them to answer fundamental questions on a qualitative level, without needing to perform the fitting process repeatedly.Through analyzing the correlation matrix generated via the linear correlation technique, SpectraFit users can save time by avoiding trial and error and promptly obtaining a reasonable model.
The Pandas 2 library provides the linear correlation, which employs a statistical technique to measure the strength and direction of a linear relationship between two variables.This technique is particularly useful when dealing with tabulated data, as it helps to assess how closely the values of one variable correspond to those of another.The formula for calculating the linear correlation coefficient () is as follows: Here, () denotes the number of data points, and (Σ) represents the sum of values.The linear correlation coefficient ranges from -1 to +1.A value () of +1 signifies a perfect positive linear relationship, -1 indicates a perfect negative linear relationship, and 0 implies no linear relationship.
In order to calculate the linear correlation, the fit results are used, which are stored internally as a Pandas DataFrame. 2The following columns are used as 1D-Arrays: 1. Energy 2. Intensity (Spectrum) 3. Residual (difference between intensity and fit) 4. Fit (sum of all single optimized distributions) 5. Single optimized distributions (each as a single column) These are automatically processed and converted into a correlation matrix, which is presented as a tabulated dataframe with    columns.By analyzing this data qualitatively and answering a few basic questions, you can gain further insight into the data as provided in Table S1.
Table S1.Respond table about the correlation between variables and their importance.Finally, the user should monitor for any outliers in the correlations that may indicate unusual behavior in specific areas or conditions.

Uncertainties and Confidence Interval
SpectraFit attempts to calculate the covariance matrix by default for the Trusted Region algorithm provided by lmfit. 3However, it is possible that the matrix could be empty, resulting in the inability to estimate uncertainty.This can occur if the model is overly constrained and during optimization, one or more parameters are limited to a boundary region.In such cases, even a single hit can result in an empty covariance matrix.
It is important to take into account the definition of the model when calculating uncertainties and displaying uncertainty bars.For Gaussian and Cumulative Gaussian models, the height is defined as the amplitude divided by the broadening.Therefore, both types of errors, amplitude and Full Width Half Maximum (FWHM), should be considered when calculating uncertainties using the equation: In order to understand uncertainties and confidence intervals in SpectraFit, it is important to start with the basics.Uncertainty refers to the amount of error or variance in an estimate, 4 while confidence intervals 5 provide a range of values within which the true value is likely to fall.To accurately estimate uncertainties and calculate confidence intervals, a model with no constraints is required.This type of model is referred to as a relaxed model by us, as shown in Figure S3.By using this model, we can obtain precise insights into the estimates and accurately calculate the confidence intervals.
When it comes to parameter estimation, the covariance matrix 4 is the tool for quantifying uncertainties associated with estimated parameters; currently, only Levenberg-Marquardt and Trusted Region algorithms are supported in SpectraFit.The process begins with estimating model parameters (θ).This process involves computing residuals (r), representing the differences between observed values () and model predictions (f(x, θ)).The Jacobian matrix (), derived from partial derivatives of residuals with respect to parameters, is crucial in capturing the sensitivity of residuals to parameter changes.Next, the covariance matrix ?CovCθ D EF is then calculated using the Jacobian matrix ().This matrix provides a comprehensive view of uncertainties in parameter estimates.The diagonal elements indicate the variances of individual parameters, signifying the magnitude of uncertainties.Larger values suggest higher uncertainties (Figure S3).Meanwhile, the offdiagonal elements represent covariances between pairs of parameters, revealing the extent to which changes in one parameter correlate with changes in another.The formula for the covariance matrix involves the variance of residuals (σ ! ) and the inverse of the squared Jacobian matrix (( .⋅ ) /0 ).Standard errors HCθ 1 K E = LCθ 1 K EO, derived as the square roots of the diagonal elements, quantify the uncertainty associated with each parameter estimate.These standard errors become crucial in constructing confidence intervals for the parameters, providing a range within which true parameter values are likely to lie with a certain level of confidence.
It is important to know that the covariance matrix not only provides insights into the individual uncertainties of parameters but also captures the interdependencies between parameters, offering a comprehensive understanding of the uncertainties inherent in the parameter estimation process.
After getting the covariance matrix, the next step is to carry out an F-test. 6This test measures the overall significance of the model by comparing the mean square of residuals to the mean square of the model.The F-test results guide the construction of confidence intervals.
For each parameter, the standard error is multiplied by the critical value from the F-distribution, yielding a margin of error according to  2  3/!, 78 ⋅ Cθ 1 K E, where  9/!, 78 t is the critical value from the t-distribution with ( − ) degrees of freedom (n is the sample size, p is the number of parameters).This subset is added and subtracted from the parameter estimate, establishing the lower and upper bounds of the confidence interval, so that the confidence interval is finally defined as CI 2 = Uθ 1 K −  2 , θ 1 K +  2 X.SpectraFit allows us to calculate this via limit; however, it is important to know for the reader that there is no guarantee to obtain these values if the optimization does not converge.Furthermore, infinity values are possible (Table S2) if the standard error is already very small and the probing creates a zero division.
Expressing these intervals in terms of sigma levels provides an intuitive measure of confidence as shown in Figure S4.A 1-interval, corresponding to one standard deviation, conveys a 68.27% confidence level.Expanding to 2and 3-intervals widen the range, offering 95.45% and 99.73% confidence levels, respectively.
When interpreting the results, wider intervals suggest higher uncertainties, while narrower intervals signal more precise parameter estimates.By integrating information from the covariance matrix and the insights derived from the F-test, it can be obtained confidence intervals that offer a robust and interpretable framework for understanding the uncertainties associated with least squares parameter estimates.Corresponding tables for Figures Changing the Type of Metric

A) B) C)
About `DescriptionAPI` and `*.lock files` As per the Good Scientific Practice Guideline 11 (Method and Standards) and 12 (Documentation) of the Deutsche Forschungs Gemeinschaft (DFG), 7 we have implemented `DescriptionAPI` via Pydantic, 8 which allows native export to Python dictionaries, and from there, to the JSON or toml file format, which can be saved as a `*.lock` file.The `DescriptionAPI` is automatically activated and it captures significant information such as the hash value of the host, version of SpectraFit, and major libraries.
Additionally, it enables users to store different types of Metadata, including refs, `projectDescription`, or free field definitions like `meta_data`; you can see Figure S6 for more information.Metadata can include details such as the sample, detection, endstation, calibration, or even a "Notice to the Reader" to guide the use of the data for scientific purposes.The notice can also be found at the end of the manuscript and serves to guide the data's use for scientific purposes.In general, metadata plays a crucial role in providing context for future analysis and keeping track of the data's suitability.
In summary, the SpectraFitNotebook object helps users organize their digital work, which is defined by the data itself and the description of the data, meaning meta-data.This allows users to provide all necessary data by simply inspecting the `*.lock` file.Our primary objective is to treat the fitting process as a scientific sub-project with the intention to publish.To achieve this, we have automated the tracking of crucial information, which prevents any chance of data reengineering or reconstruction.The comparison between  2 and  1 K is the basis that SpectraFit uses to achieve quantitative evaluation through various scikit-learn 9 scoring metrics implemented within the SpectraFit:

Figure S1 .
Figure S1.The complete CLI prompt of Figure5, which is displayed in the user's terminal.The printout always includes six types of tables, such as descriptive statistics, fit statistics, variables with optional errors, correlation of each model component, general correlation of each fit variable, and the regression metric.Additionally, the confidence intervals can be provided as an optional table in the terminal for the user.

Figure S2 .
Figure S2.Terminal output as confidence table for the relaxed model of Figure S4.

Figure S3 .
Figure S3.Relaxed fit with uncertainties bars.Adapted with permission from Ref. 1.Copyright 2023 American Chemical Society.

Figure S5 .
Figure S5.On the top, Plot A displays the metrics as a series of runs with bar plots for Akaike Information and Bayesian Information Criteria, and a line plot for Mean Square Error.In the middle, Plot B shows the corresponding command to switch with one execution according to the new metric for bar and line.On the bottom, Plot C displays the results as a bar plot for Mean Square Error and reduced Chi-Square and a line plot for R 2 .

Figure S6 .
Figure S6.Code capture of the input data for the `DescriptionAPI`.
) represents the actual spectrum values, ( r ) represents the predicted spectrum values, and ( ̅ ) represents the mean of the actual spectrum values.

Figure S8 .
Figure S8.Reference code to generate a True vs. Predicted plot via Seaborn library and save the result as PDF.

Table S2 .
Confidence Interval Table for lower and upper sigma levels from 1-to 3- for the relaxed model.

Table S3 .
The corresponding uncertainties of each component of complex 1 for Figure7.

Table S6 .
List of uncertainties for simultaneously fitting, Figure11.