RamanSPy: An Open-Source Python Package for Integrative Raman Spectroscopy Data Analysis

Raman spectroscopy is a nondestructive and label-free chemical analysis technique, which plays a key role in the analysis and discovery cycle of various branches of science. Nonetheless, progress in Raman spectroscopic analysis is still impeded by the lack of software, methodological and data standardization, and the ensuing fragmentation and lack of reproducibility of analysis workflows thereof. To address these issues, we introduce RamanSPy, an open-source Python package for Raman spectroscopic research and analysis. RamanSPy provides a comprehensive library of tools for spectroscopic analysis that supports day-to-day tasks, integrative analyses, the development of methods and protocols, and the integration of advanced data analytics. RamanSPy is modular and open source, not tied to a particular technology or data format, and can be readily interfaced with the burgeoning ecosystem for data science, statistical analysis, and machine learning in Python. RamanSPy is hosted at https://github.com/barahona-research-group/RamanSPy, supplemented with extended online documentation, available at https://ramanspy.readthedocs.io, that includes tutorials, example applications, and details about the real-world research applications presented in this paper.


RamanSPy
Raman spectroscopy (RS) is a powerful sensing modality based on inelastic light scattering, which provides qualitative and quantitative chemical analysis with high sensitivity and specificity [1].RS yields a characterisation of the vibrational profile of molecules, which can help elucidate the composition of chemical compounds, biological specimens and materials [2][3][4].In contrast to most conventional technologies for (bio)chemical characterisation (e.g., staining, different omics, fluorescence microscopy and mass spectrometry), RS is both label-free and non-destructive, thereby allowing the acquisition of rich biological and chemical information without compromising the structural and functional integrity of the probed samples.This advantage has enabled a broad range of applications of RS in biomedical and pharmaceutical research [5][6][7], including in the imaging of cells and tissues [8][9][10][11], the chemical analysis of drug compounds [12,13], and the detection of disease [14][15][16][17].
An area of topical interest is the frontier of Raman spectroscopy, chemometrics and artificial intelligence (AI), with its promise of more autonomous, flexible and data-driven RS analytics [18][19][20].There has been a recent surge in the adoption of AI methods in Raman-based research [4], with applications to RS now spanning domains as broad as the identification of pathogens and other microbes [21][22][23][24]; the characterisation of chemicals, including minerals [25], pesticides [26] and other analytes [27,28]; the development of novel diagnostic platforms [29][30][31][32]; as well as the application of techniques from computer vision for denoising and super-resolution in Raman imaging [33].
As new hardware, software and data acquisition RS technologies continue to emerge [34,35], there is a pressing need for an integrated RS data analysis environment, which facilitates the development of pipelines, methods and applications, and bolsters the use of RS in biomedical research.Yet, the full deployment of RS and its capabilities is still hindered by practical factors stemming from the restrictive, functionally disparate, and highly encapsulated nature of current commercial software for RS data analysis.RS data analysis often operates within proprietary software environments and data formats, which have induced methodological inconsistencies and reduced cross-platform and benchmarking efforts, with growing concerns around reproducibility.These restrictions have also hampered the adoption of new AI technologies into the field [36][37][38][39][40].As a consequence, researchers increasingly resort to developing inhouse scripts for RS analysis in Python [41], further adding to methodological fragmentation and lack of standardisation [42].
In response to these challenges, we have developed RamanSPy -a modular, open-source framework for integrated Raman Spectroscopy analytics in Python.RamanSPy is designed to systematise day-to-day workflows, enhance algorithmic development and validation, and accelerate the adoption of novel AI technologies into the RS field.Firstly, RamanSPy serves as a platform for general-purpose RS analytics supporting the RS data life cycle by providing a suite of ready-to-use modules for data loading, preprocessing, analysis and visualisation.By design, these functionalities are not tied to any specific technology or data type, thereby allowing integrative and transferable Fig. 1 General Raman spectroscopy workflow and core features of RamanSPy .a, RamanSPy supports the Raman spectroscopic data analysis life cycle via a modular, loosely coupled architecture.RS data is parsed to a common data representation format, which is interfaced with preprocessing, analysis and visualisation tools within RamanSPy.The core features of RamanSPy include a comprehensive library of standardised, simple-touse procedures for data loading, preprocessing, analysis and visualisation.These modules are flexible and allow the incorporation of further techniques and in-house methods.For complete information about the modules available in RamanSPy, refer to the documentation at https://ramanspy.readthedocs.io.b, An example workflow use case in RamanSPy: Raman data is loaded, preprocessed and analysed in a few lines of code.
cross-platform analyses.Secondly, RamanSPy addresses challenges in data preprocessing by facilitating the compilation of reproducible pipelines to streamline and automatise preprocessing protocols.Thirdly, RamanSPy helps bridge the gap between RS data and state-of-the-art AI technologies within the extensive machine learning (ML) ecosystem in Python.Complemented by direct access to Raman datasets, preprocessing protocols and performance metrics, this provides the foundation for AI model development and benchmarking.
The codebase of RamanSPy is hosted at https://github.com/barahona-research-group/RamanSPy with extended documentation (https://ramanspy.readthedocs.io),which includes tutorials and example applications, and details about the real-world research applications presented in this paper.

Results
RamanSPy as a platform for general Raman spectroscopy analytics.RamanSPy is based on a modular, object-oriented programming (OOP) infrastructure, which streamlines the RS data analysis life cycle (Fig. 1a) and allows users to compile diverse analysis workflows with a few lines of reusable, user-friendly code (Fig. 1b).The framework adopts a scalable array-based data representation, which accommodates different spectroscopic modalities, including single-point spectra, Raman imaging data, and volumetric scans.
Experimental data can be loaded through custom loaders built into RamanSPy or through standard tools available in Python.The data representation functions as a common data container that defines the interface between RS data management and manipulation within RamanSPy, allowing us to unify data standards across setups and vendors, independent of instrumental origin and acquisition modality.
RamanSPy also provides an extensive toolbox for preprocessing, analysis and visualisation.The preprocessing suite includes techniques for denoising, baseline correction, cosmic spike removal, normalisation and background subtraction, among others.Likewise, the analysis toolbox includes modules for decomposition (useful for dimensionality reduction), clustering and spectral unmixing.RamanSPy also includes a set of data visualisation tools.All these modules are organised into an extensible class structure, which standardises their application across projects and datasets to facilitate transferable analysis workflows.
We showcase the core features of RamanSPy by analysing volumetric Raman spectroscopic data from a human leukaemia monocytic (THP-1) cell [9] (Fig. 2).The aim is to investigate the cell phenotype in a label-free manner using RS and methods from chemometrics.We load the data using builtin tools, and perform a spectral preprocessing protocol comprising spectral cropping to the fingerprint region (700-1800 cm −1 ), cosmic spike removal, denoising, baseline correction and normalisation (see SI).Using the visualisation tools in the package, we inspect data quality (Fig. 2b) and perform initial exploratory analysis by examining, e.g., data slices across wavenumber bands (Fig. 2c).The analysis proceeds to spectral unmixing based on: (i) N-FINDR [43] for endmember detection, and (ii) fully constrained least squares (FCLS) [44] for component quantification.This process is exploited to demix signal contributions from different cellular components and study their morphological organisation within the THP-1 cell.Following the peak assignment in [9], we distinguish endmember components related to lipids (band 1008 cm −1 ), nucleic acid (band 789 cm −1 ), cytoplasm (bands 1066, 1134, 1303, 1443 and 1747 cm −1 ), and the background (Fig. 2e).Finally, we produce fractional abundance reconstructions based on the extracted endmembers, which we can examine on a single-layer level (Fig. 2f) and across the entire volume (Fig. 2g) to localise cellular organelles within the cell.
RamanSPy enables automated pipelining of spectral preprocessing protocols.Experimental RS data is susceptible to non-specific signal artefacts (e.g., cosmic rays, autofluorescence background, variability in instrumentation), which can severely affect downstream analyses.Preprocessing is therefore a critical step in any spectroscopic analysis workflow [45,46].Yet, due to a lack of standardisation and frameworks for general-purpose pipelining [36], researchers tend to utilise variable preprocessing protocols, often dispersed across different software systems, thus affecting reproducibility and validation [47,48].[9].b, An exemplar spectrum from the raw volumetric Raman data (taken from the centre of the layer in d).The fingerprint region (700-1800 cm −1 ) shaded in red was used for the analysis.c, Volumetric data at the 1008 cm −1 band (characteristic of proteins) after preprocessing.d-g, Spectral unmixing analysis reveals the distribution of components within the cell: lipids (violet), nucleus (blue), cytoplasm (green), and background (yellow).d, A merged reconstruction of the sixth depth layer (10 in total) of the THP-1 cell determined via spectral unmixing.e, Four endmembers derived with N-FINDR [43] characterised via peak assignment.f, Fractional abundance maps calculated with FCLS [44] for the sixth depth layer.g, Fractional abundance maps for the entire volume.RamanSPy To facilitate the creation of reproducible protocols, RamanSPy incorporates a pipelining infrastructure, which systematises the process of creating, customising and executing preprocessing pipelines (Fig. 3a).Users can use a specialised class, which defines a generic, multi-layered preprocessing procedure, to assemble pipelines from selected built-in preprocessing modules or other in-house methods.To reduce overhead, the constructed pipelines are designed to function exactly as any single method, i.e., they are fully compatible with the rest of the modules and data structures in the package.Furthermore, pipelines can be easily saved, reused and shared to foster the development of a repository of preprocessing protocols.As a seed to this repository, RamanSPy provides a library of assembled preprocessing protocols (custom pre-defined, or adapted from the literature [49]), which users can access and exploit.
To illustrate the pipelining functionalities, we use RamanSPy to construct three preprocessing protocols by compiling selected methods in the desired order of execution, and applying them out-of-the-box to data loaded into the platform (Fig. 3c-e).We use them to preprocess Raman spectroscopic data from [9] (Fig. 3b).Note how the three pipelines yield substantially different results, reinforcing the importance of consistency in the selection of preprocessing protocols.Pipeline II was deemed the most robust, and consequently added to the protocols library in RamanSPy as default.
RamanSPy facilitates AI integration and validation of nextgeneration Raman data analytics.To help accelerate the adoption of AI technologies for RS analysis, RamanSPy is endowed with a permeable architecture, which streamlines the interface between Raman spectroscopic data and the burgeoning ML ecosystem in Python.This is complemented by tools for benchmarking, such as datasets and performance metrics, which support the evaluation of new models and algorithms.We show below two examples of RamanSPy's capabilities for ML integration and benchmarking.
First, RamanSPy allows the seamless integration of standard Python AI/ML methods (e.g., from scikit-learn [53], PyTorch [54] and tensorflow [55]) as tools for RS analysis (Fig. 4a).As an illustration, we use RamanSPy to construct a deep learning denoising procedure based on the one-dimensional ResUNet model -a fully convolutional UNet neural network with residual connections [33].To do this, we simply wrap within RamanSPy the pre-trained neural network (trained on spectra from MDA-MB-231 breast cancer cells, available at https://github.com/conor-horgan/DeepeR)as a custom denoising method.Once wrapped, the denoiser is automatically compatible with the rest of RamanSPy and can be readily employed for different applications.For instance, we replicate the results in [33], and show in Fig. 4b-c that the application of this deep-learning denoiser to the low signal-to-noise ratio (SNR) test set from [33] consistently outperforms the commonly-used Savitzky-Golay filter [56], as quantified by various metrics also coded within RamanSPy (e.g., mean squared error (MSE), spectral angle distance (SAD) [57] and spectral information divergence (SID) [58]).Applying this pipeline to new data only Users can assemble built-in and in-house methods into complete preprocessing pipelines, which are fully compatible with data integrated within RamanSPy and can be saved, reused and shared.RamanSPy also provides access to a library of already assembled preprocessing pipelines.b, Two raw spectra from the THP-1 data from [9] are used to compare the effect of different preprocessing pipelines.c-e, The results of three preprocessing pipelines built within RamanSPy, demonstrating the need for standardisation.Note on preprocessing methods: fingerprint region is 700-1800 cm −1 ; ASLS -Asymmetric Least Squares [50]; asPLS -Adaptive Smoothness Penalized Least Squares [51]; AUC -area under the curve; cosmic rays removed with algorithm from [52].[33] is integrated as a preprocessing module within RamanSPy to investigate its performance against the Savitzky-Golay (SG) filter [56].b, Denoising of a spectrum from [33], where the low-SNR (purple) is the input and the high-SNR (green) is the target.The data is denoised with a SG filter of polynomial order 3 and kernel size 9, SG(3, 9) (blue), and with the implemented deep-learning denoiser (yellow).c, The results on the test set from [33]  involves changing the data source.Taking advantage of this transferability, we test the denoiser on unseen volumetric Raman data from another cell line (THP-1 [9]), with added Gaussian noise (see SI).In this case, Fig. 4d-e shows improved performance especially according to the MSE metric, which is dependent on normalisation, but with lower significance according to scale-invariant and information-theoretic metrics, also available in RamanSPy.This example emphasises the importance of incorporating robust validation criteria within data analysis workflows.
Secondly, the data management backbone of RamanSPy ensures a direct data flow to the rest of the Python ecosystem, i.e., data can be loaded, preprocessed, and analysed in RamanSPy and then exported to conduct further modelling and analysis elsewhere (Fig. 5a).As an example application, we perform AI-based bacteria identification using Raman measurements [21] from 30 bacterial and yeast isolates (Fig. 5b).After loading and exploring the spectra with RamanSPy, we interface the data with the lazypredict Python package [60] and benchmark 28 different ML classification models (including logistic regression, support vector machines and decision trees) on the task of predicting the species from the spectrum.The models were trained on a high-SNR dataset (100 spectra per isolate) and tested on an unseen high-SNR testing set of the same size.Our benchmarking analysis in Fig. 5c finds logistic regression as the best-performing model, achieving a classification accuracy of 79.63% on the species-level classification task (Fig. 5d), and 94.63% for antibiotic treatment classification (Fig. 5e).
To further assist validation against previous results, RamanSPy provides access to a library of curated datasets, which can be integrated into analysis and benchmarking workflows.This lays the foundation for a common repository of RS data and reduces barriers to data access, especially for ML teams with limited access to RS instruments [19].The dataset library in RamanSPy already includes data loaders for Raman data from bacterial species [21], cell lines [9,33], COVID-19 samples [61,62], multi-instrument Surface Enhanced Raman Spectroscopy (SERS) measurements of adenine samples [63], wheat lines [64], minerals [65], and will continue to be expanded.

Discussion
In this paper, we have introduced RamanSPy -a computational framework for integrative Raman spectroscopic data analysis.RamanSPy offers a comprehensive collection of tools for spectroscopic analysis designed to systematise the RS data analysis life cycle, reducing typical overheads of analysis workflows and improving methodological standardisation.The package also lays the foundations of a common repository of standardised methods, protocols and datasets, which users can readily access and exploit within the RamanSPy framework to conduct different benchmarking studies.Furthermore, RamanSPy is fully compatible with frameworks for data science and machine learning in Python, thereby facilitating the adoption and validation of advanced AI technologies for next-generation RS analysis.Lastly, we remark that, while our focus here has been on Raman spectroscopy, many of the tools in RamanSPy are of broad applicability to other vibrational spectroscopy techniques, including infrared (IR) spectroscopy.

Methods Installation
RamanSPy has been deposited in the Python Package Index (https://pypi.org/project/ramanspy).This means it can be directly installed via the common package installer pip for Python: To access the functionalities of the package after installation, users only need to import RamanSPy in their Python scripts.One can import the whole package:

Core infrastructure
Data management.Data in RamanSPy is represented by a set of custom data container classes based on scalable, computationally efficient array programming [66], which correspond to different spectroscopic modalities.This includes the generic SpectralContainer class, as well as the more specialised Spectrum, SpectralImage and SpectralVolume classes representing on singlepoint spectra (1D), imaging data (3D), volumetric data (4D) respectively.These classes define data-specific information and behaviour in the background to allow a smooth, user-friendly experience, regardless of the data of interest.
The containers can be initialised by providing the corresponding intensity data, the spectral axis (in cm −1 ) and other relevant (meta) data, which will become properties of the constructed object.For instance: raman_spectrum = ramanspy .Spectrum ( intensity_data , spectral_axis , * args , ** kwargs ) raman_image = ramanspy .SpectralImage ( intensity_data , spectral_axis , * args , ** kwargs ) Once created, data containers can be manipulated, visualised, saved and loaded as needed using the built-in tools in RamanSPy.
Note that for the most part, users would not need to manually populate these containers.Instead, they can take advantage of the data loading functionalities that RamanSPy provides.
Data loading.To support data loading, RamanSPy offers easy-to-use data loaders compatible with experimental Raman spectroscopic data from a range of instrumental vendors in the area.These loaders -available within ramanspy.load-automatically parse relevant data files and return the appropriate spectral container.As an example, users can load MATLAB files exported from WITec's ProjectFOUR/FIVE software using the following command: raman_object = ramanspy .load .witec ( < PATH >) A full list of the data loaders built into RamanSPy is available as part of the documentation of the package at https://ramanspy.readthedocs.io/en/latest/loading.html.
Raman data can also be loaded via established data-loading tools in Python.For instance, one can use pandas' csv loader to load a spectrum from a .csvfile with two columns storing the intensity data and the spectral axis by using: import pandas as pd data = pd .read_csv ( csv_filename ) raman_spectrum = ramanspy .Spectrum ( data [ " < intensity_column > " ] , data [ " < axis_column > " ]) Spectral preprocessing.Preprocessing logic in RamanSPy is defined by the PreprocessingStep class, which defines most of the necessary preprocessing infrastructure in the background to ensure a smooth, data-agnostic experience via a single point of contact specified through their apply() method.Yet, as with data loading, for the most part, users are not expected to use this class to manually implement and optimise such preprocessing methods themselves.Instead, the RamanSPy package provides a comprehensive toolbox of ready-to-use preprocessing methods, which users can access, customise and employ to compile a wide variety of preprocessing procedures.These preprocessing procedures are given as predefined classes within ramanspy.preprocessingwhich extend the PreprocessingStep class.To use these built-in methods, users need to create an instance of the selected technique.Note that RamanSPy offers full control over relevant parameters, which can be supplied during initialisation via the *args and **kwargs arguments.
As the methods inherit all operational logic defined within the parent Pre-processingStep class, they can be directly accessed and used on any data loaded in the framework through their apply() method: A full list of the methods for spectral preprocessing built into RamanSPy is available as part of the documentation of the package at https://ramanspy.readthedocs.io/en/latest/preprocessing.html.
If needed, users can also incorporate any in-house method into RamanSPy by manually creating instances of the PreprocessingStep class which wrap the given method.This can be done as follows: Then, the custom preprocessing method is fully compatible with the rest of RamanSPy's functionalities and out-of-the-box applicable to any data integrated within the package via its apply() method: Note that this class structure implies that these instances can then be saved (e.g. as pickle files) and, therefore, reused and shared as required afterwards.
Spectral analysis.As with preprocessing classes, users can access any builtin analysis method (available within the ramanspy.analysissub-module) by creating an object instance of the corresponding class (again -with full control over relevant parameters) as follows: Once created, instances can be similarly accessed via their apply() method on any data loaded in RamanSPy.
cluster_maps , clust er _c e nt re s = kmeans .apply ( < spectral object or collection of spectral objects >) abundance_fractions , endmemebrs = unmixer .apply ( < spectral object or collection of spectral objects >) A full list of the methods for spectral analysis built into RamanSPy is available as part of the documentation of the package at https://ramanspy.readthedocs.io/en/latest/analysis.html.

Preprocessing pipelines
Pipelining behaviour is defined by the Pipeline class in RamanSPy, which ensures that pipelines are accessible, simple-to-use and fully compatible with the rest of RamanSPy.
Creating a custom preprocessing pipeline.To assemble a preprocessing pipeline, one simply needs to stack relevant methods (built-in or custom) into the intended order of execution.For instance: Constructed pipelines can then be applied exactly as single methods via their apply() method to any data loaded within RamanSPy.As pipelines in RamanSPy are objects, they can also be directly saved in a convenient file format, such as pickle files.As such, they can then be reloaded, reused and shared as needed.
Access a predefined preprocessing pipeline.RamanSPy also provides a collection of built-in preprocessing pipelines.To access them, one can select the desired protocol from ramanspy.preprocessing.protocolsas follows: p r e p r o c e s s i n g _ p i p e l i n e = ramanspy .preprocessing .protocols .PROTOCOL_X A pre-defined Pipeline instance will be returned, which can similarly be employed directly through its apply() method.
A full list of the protocols for spectral preprocessing built into RamanSPy is available as part of the documentation of the package at https://ramanspy.readthedocs.io/en/latest/preprocessing.html#established-protocols.

AI integration
Integrate AI methods into RamanSPy .To integrate new techniques for spectral preprocessing and analysis, users can take advantage of the extensible architecture of RamanSPy and wrap models and algorithms into custom classes.For instance, one can create a new denoiser method based on a PyTorch model for denoising by simply creating a function, which defines how the model can be used to preprocess a generic intensity data array, and then wrapping the method within a PreprocessingStep instance.
def nn_preprocesing ( intensity_data , w a ve nu m be r_ ax i s ) : intensity_data = v .reshape ( -1 , intensi ty_data .shape [ -1]) output = model ( torch .Tensor ( inten sity_da ta ) .unsqueeze (1) ) .cpu () .detach () .numpy () output = np .squeeze ( output ) .reshape ( intensi ty_data .shape ) return output , w a ve nu m be r_ ax i s nn_denoiser = ramanspy .preprocessing .P r e p r o c e s s i n g S t e p ( nn _ pr ep ro c es in g ) Integrated methods are automatically rendered fully compatible with the rest of RamanSPy's functionalities in the background, so one can simply use the apply() method of the constructed denoiser to preprocess any data loaded within RamanSPy as any built-in preprocessing class.
Export data from RamanSPy to AI frameworks.The data management core of RamanSPy allows a direct interface with the entire Python ecosystem, including frameworks for statistical modelling, machine learning and deep learning.To do that, users can simply feed relevant data from RamanSPy to functions and tools they want to use elsewhere.For instance, one can pass the intensity data stored in a spectral container to a specific model from the scikit-learn [53] framework for statistical and ML modelling directly via their fit() method: model .fit ( s p ec tr a l _ c o n t a i n e r .spectral_data )

Datasets
To access the Raman spectroscopic datasets available in RamanSPy, users can employ custom data-loading methods built into RamanSPy under ramanspy.datasets.These would automatically parse the relevant data into the corresponding spectral container.For instance, one can load the bacteria data from [21] using the following function: data_container , labels = ramanspy .datasets .bacteria ( dataset = " train " , < PATH >) Note that, depending on where each dataset was deposited and the license it was deposited under, some of these methods will automatically download the given dataset, whereas others may require the manual download of the data.Users are pointed to the documentation of each method for instructions on how to properly load each dataset.
A full list of the datasets built into RamanSPy is available as part of the documentation of the package at https://ramanspy.readthedocs.io/en/latest/datasets.html.

Metrics
Users can likewise readilyaccess relevant spectroscopic metrics, such as MSE, SAD and SID, from ramanspy.metrics.These can be used to measure the similarity between spectra by using the respective method:

Fig. 2
Fig.2Morphological analysis of a THP-1 cell via spectral unmixing with RamanSPy .a, Bright-field image of a THP-1 cell.The same cell was also imaged with Raman spectroscopy.Image and volumetric Raman data from[9].b, An exemplar spectrum from the raw volumetric Raman data (taken from the centre of the layer in d).The fingerprint region (700-1800 cm −1 ) shaded in red was used for the analysis.c, Volumetric data at the 1008 cm −1 band (characteristic of proteins) after preprocessing.d-g, Spectral unmixing analysis reveals the distribution of components within the cell: lipids (violet), nucleus (blue), cytoplasm (green), and background (yellow).d, A merged reconstruction of the sixth depth layer (10 in total) of the THP-1 cell determined via spectral unmixing.e, Four endmembers derived with N-FINDR[43] characterised via peak assignment.f, Fractional abundance maps calculated with FCLS[44] for the sixth depth layer.g, Fractional abundance maps for the entire volume.

Fig. 3
Fig.3Spectral preprocessing pipelining in RamanSPy .a, RamanSPy automates the construction, customisation and execution of multi-layered preprocessing procedures via pipelining.Users can assemble built-in and in-house methods into complete preprocessing pipelines, which are fully compatible with data integrated within RamanSPy and can be saved, reused and shared.RamanSPy also provides access to a library of already assembled preprocessing pipelines.b, Two raw spectra from the THP-1 data from[9] are used to compare the effect of different preprocessing pipelines.c-e, The results of three preprocessing pipelines built within RamanSPy, demonstrating the need for standardisation.Note on preprocessing methods: fingerprint region is 700-1800 cm −1 ; ASLS -Asymmetric Least Squares[50]; asPLS -Adaptive Smoothness Penalized Least Squares[51]; AUC -area under the curve; cosmic rays removed with algorithm from[52].

Fig. 4
Fig.4RamanSPy interfaces with AI/ML Python frameworks to create new methods for RS analysis.a, RamanSPy allows users to incorporate AI/ML models seamlessly into pipelines created within the platform.b-c, A pre-trained 1D ResUNet deep-learning denoiser[33] is integrated as a preprocessing module within RamanSPy to investigate its performance against the Savitzky-Golay (SG) filter[56].b, Denoising of a spectrum from[33], where the low-SNR (purple) is the input and the high-SNR (green) is the target.The data is denoised with a SG filter of polynomial order 3 and kernel size 9, SG(3, 9) (blue), and with the implemented deep-learning denoiser (yellow).c, The results on the test set from[33] (n = 12694) show that the deep-learning denoiser outperforms six SG filters across three performance metrics (MSE, SAD, SID).Error bars represent one standard deviation around the sample mean.Statistical significance measured with a two-sided Wilcoxon signed-rank test with adjustment for multiple comparisons based on Benjamini-Hochberg correction[59] (* P < 0.05, ** P < 0.01, *** P < 0.001, **** P < 0.0001).d-e, Same analysis on unseen data from[9] (n = 1600).The input (purple) corresponds to data contaminated with added noise and the target (green) to the original data.In this case, the deep-learning denoiser only shows an improvement for MSE.
Fig.4RamanSPy interfaces with AI/ML Python frameworks to create new methods for RS analysis.a, RamanSPy allows users to incorporate AI/ML models seamlessly into pipelines created within the platform.b-c, A pre-trained 1D ResUNet deep-learning denoiser[33] is integrated as a preprocessing module within RamanSPy to investigate its performance against the Savitzky-Golay (SG) filter[56].b, Denoising of a spectrum from[33], where the low-SNR (purple) is the input and the high-SNR (green) is the target.The data is denoised with a SG filter of polynomial order 3 and kernel size 9, SG(3, 9) (blue), and with the implemented deep-learning denoiser (yellow).c, The results on the test set from[33] (n = 12694) show that the deep-learning denoiser outperforms six SG filters across three performance metrics (MSE, SAD, SID).Error bars represent one standard deviation around the sample mean.Statistical significance measured with a two-sided Wilcoxon signed-rank test with adjustment for multiple comparisons based on Benjamini-Hochberg correction[59] (* P < 0.05, ** P < 0.01, *** P < 0.001, **** P < 0.0001).d-e, Same analysis on unseen data from[9] (n = 1600).The input (purple) corresponds to data contaminated with added noise and the target (green) to the original data.In this case, the deep-learning denoiser only shows an improvement for MSE.

Fig. 5
Fig. 5 RamanSPy as a suite for algorithmic development and benchmarking.a, Data representations in RamanSPy are compatible with the Python AI/ML ecosystem, allowing data flow from RamanSPy to scikit-learn [53], PyTorch [54], tensorflow [55], etc. RamanSPy is also equipped with standard datasets and relevant metrics to support model development and validation.b-e Benchmarking ML classification models on the task of bacteria identification from Raman spectra [21].b, Mean Raman spectra for all bacterial species provided (100 spectra per species).Spectra are min-max normalised to the range 0-1 for visualisation purposes.c, Benchmarking results of 28 ML models.The best accuracy was achieved by the logistic regression classifier.d-e, Confusion matrices for the best species-level (d) and antibiotic-level (e) classifier with accuracies of 79.63% and 94.63%, respectively.
import ramanspy # or import ramanspy as rp or individual modules or methods: # individual modules from ramanspy import load , preprocessing # individual methods from ramanspy .analysis .unmix import NFINDR For instance: denoiser = ramanspy .preprocessing .denoise .SavGol (* args , ** kwargs ) b a s el i ne _ co rr e ct o r = ramanspy .preprocessing .baseline .ASLS (* args , ** kwargs ) normaliser = ramanspy .preprocessing .normalise .MaxIntensity (* args , ** kwargs ) p r e p r o c e s s e s d _ o b j e c t s = denoiser .apply ( < spectral object or collection of spectral objects >) p r e p r o c e s s e s d _ o b j e c t s = b a s e l i n e _ c o r r e c t o r .apply ( < spectral object or collection of spectral objects >) def p re p ro c es si n g_ f u n c ( intensity_data , spectral_axis , * args , ** kwargs ) : # Preprocess i ntensit y_data and spectral_axis ... return updated_intensity_data , u p d a t e d _ s p e c t r a l _ a x i s # wrapping the function together with the relevant * args and ** kwargs c u s t o m _ p r e p r o c e s s i n g _ m e t h o d = ramanspy .preprocessing .P r e p r o c e s s i n g S t e p ( preprocessing_func , * args , ** kwargs ) c u s t o m _ p r e p r o c e s s i n g _ m e t h o d .apply ( < spectral object or collection of spectral objects >) nmf = ramanspy .analysis .decompose .NMF (* args , ** kwargs ) kmeans = ramanspy .analysis .cluster .KMeans (* args , ** kwargs ) unmixer = ramanspy .analysis .unmix .NFINDR (* args , ** kwargs ) p r e p r o c e s s i n g _ p i p e l i n e = ramanspy .preprocessing .Pipeline ([ ramanspy .preprocessing .denoise .SavGol (* args , ** kwargs ) , ramanspy .preprocessing .baseline .ASLS (* args , ** kwargs ) , ramanspy .preprocessing .normalise .MaxIntensity (* args , ** kwargs ) , c u s t o m _ p r e p r o c e s s i n g _ m e t h o d (* args , ** kwargs ) # custom in -house method ]) p r e p r o c e s s e s d _ o b j e c t s = p r e p r o c e s s i n g _ p i p e l i n e .apply ( < spectral object or collection of spectral objects >) ramanspy .metrics .SID ( spectrum_I , spectrum_II ) RamanSPy A full list of the metrics built into RamanSPy is available as part of the documentation of the package at https://ramanspy.readthedocs.io/en/latest/metrics.html.D.G. is supported by UK Research and Innovation [UKRI Centre for Doctoral Training in AI for Healthcare grant number EP/S023283/1].S.V.P. gratefully acknowledges support from the Independent Research Fund Denmark (0170-00011B).R.X. and M.M.S. acknowledge support from the Engineering and Physical Sciences Research Council (EP/P00114/1 and EP/T020792/1).A.F.G. acknowledges support from the Schmidt Science Fellows, in partnership with the Rhodes Trust.M.M.S. acknowledges support from the Royal Academy of Engineering Chair in Emerging Technologies award (CiET2021\\94).M.B. acknowledges support by the EPSRC under grant EP/N014529/1, funding the EPSRC Centre for Mathematics of Precision Healthcare at Imperial College London, and under grant EP/T027258/1.The authors thank Dr Akemi Nogiwa Valdez for proofreading and data management support.Figures were created with BioRender (www.biorender.com).