PS2MS: A Deep Learning-Based Prediction System for Identifying New Psychoactive Substances Using Mass Spectrometry

The rapid proliferation of new psychoactive substances (NPS) poses significant challenges to conventional mass-spectrometry-based identification methods due to the absence of reference spectra for these emerging substances. This paper introduces PS2MS, an AI-powered predictive system designed specifically to address the limitations of identifying the emergence of unidentified novel illicit drugs. PS2MS builds a synthetic NPS database by enumerating feasible derivatives of known substances and uses deep learning to generate mass spectra and chemical fingerprints. When the mass spectrum of an analyte does not match any known reference, PS2MS simultaneously examines the chemical fingerprint and mass spectrum against the putative NPS database using integrated metrics to deduce possible identities. Experimental results affirm the effectiveness of PS2MS in identifying cathinone derivatives within real evidence specimens, signifying its potential for practical use in identifying emerging drugs of abuse for researchers and forensic experts.


Drug enumeration
To predict all potential emerging drugs of abuse, the aim is to employ the known fundamental structures of various drugs of abuse as input and perform chemical modifications without altering their primary structure.This approach is designed to prevent any changes in drug properties that could result in the loss of psychoactive effects.The molecular structure is represented using the SMILES format.To obtain the structural formula, we rely on RDKit 5 , an Open-Source Cheminformatics C++ Library that provides us with the necessary data structures, such as the molecular fingerprint and mass weight.The mass weight is used by the mass weight filter (mwf) in the ranking step to only consider derivatives that weigh within a given tolerance range (± 1) of the M+ peak of the spectrum of the analyte.
This study uses a systematic approach to enumerate derivatives based on the input drug structure by removing adjacent terminal hydrogen atoms and attaching functional groups at that site to achieve chemical modification of the drug.For the purpose of achieving a comprehensive listing of all potential emerging drugs of abuse, a collection of 204 known functional group structures was gathered.However, it was found that a single modification of the target drug was S 4   not sufficient to enumerate all possibilities for all drugs.Analyzing data from UNODC, which comprises over 200 cathinone-type listed drugs, reveals that most of these drugs necessitate three to four chemical modifications and encompass more than 30 functional groups.We found some of these listed drugs demand four rounds of functional group modifications, which include slight alterations to the carbon chain, such as adding ethyl groups.These extra functional groups can be readily synthesized and incorporated.Due to their simplicity in synthesis, they become attractive choices for drug synthesis steps.We tackled these modifications by adding additional functional groups -dimethyl, dibutyl, and dipropyl -were introduced to the functional group collection in the 3 rd recursion.These supplementary functional groups, requiring two modifications each, were incorporated to prevent the omission of controlled derivative drugs.A recursive process was carried out three times to generate drug derivatives.
The 57 drug-like criteria 1 (see Table S2) are used to further eliminate derivatives.Critically for GC/MS, our drug-like filters address concerns related to thermal stability, volatility, and chemical inertness by implementing specific rules such as the exclusion of compounds prone to decomposition at high temperatures (Rule 7, C=N filter), avoidance of decarboxylation-prone substances (Rule 8, decarboxy filter), and filtering out large polycyclic compounds that might decompose (Rule 21, polycyclic filter).Additionally, rules targeting non-volatility, including restrictions on the number and size of rings (Rule 29 and Rule 30) and the number of rotatable bonds (Rule 50), contribute to refining compound selection for applications such as GC-MS.The consideration of chemical inertness is addressed through rules targeting potentially reactive functional groups (Rule 51) and bonds that might be reactive under GC conditions (Rule 37 and Rule 38).These drug-like filters collectively enhance the suitability and reliability of compounds in our synthetic NPS database for diverse analytical methodologies, including GC-MS.

Use SCScore to filter functional groups
To prevent the generation of derivatives that are impractical to synthesize, the SCScore model is utilized to screen functional groups.The "full_reaxys_model_1024bool" pre-trained model (downloaded from https://github.com/connorcoley/scscore/tree/master/models) is used, with the SMILES of the compound serving as input to obtain the synthesis difficulty score of the compound.
To screen functional groups, n recursions of modification are initially performed for a functional group, potentially resulting in several derivatives due to various modification positions at each layer.For the modified derivatives of the nth recursion, individual synthesis difficulty is calculated using SCScore, and the average score is determined.Based on the synthesis difficulty of the target compound, if the average synthesis difficulty score is below (  , where is a user-adjustable hyperparameter, the functional group will be added to the ) + α  α functional group set of the nth recursion.In this paper, ɑ is set to 1.2.The functional group set of each recursion will be the cascade of the functional group sets below the layer.Note that all 204 functional groups are included in the first recursion regardless of their synthesis difficulty.

Candidate Selection
The ACSESS 1 algorithm's built-in screening rules are utilized, encompassing 40 rules for identifying unstable molecules and 17 rules for selecting drug molecules (see Table S2).This helps to reduce unnecessary computation and identify potential emerging drugs that can be prepared.The molecular structure screening method of ACSESS has been implemented using the RDKit library.RDKit was selected due to its cost-effectiveness for future use and promotion, as well as its modularization capabilities and ease of program acceleration compared to Python.

Training and Testing Datasets
In order to train the prediction model of mass spectrometry, the 2017 NIST Mass Spectral Library was used as training data and testing data.NIST data is collected by the National Institute of Standards and Technology and contains about 210,000 compound 2D structures (format in .SDF) and their corresponding mass spectra (format in .msp).In addition to the NIST dataset, the SWGDRUG 3.11 dataset, comprising approximately 3,500 mass spectra of known illicit drugs, was also employed.The mass spectrum data of each compound were normalized, so that the sum of intensities of each mass spectrum is 100.The aim is to avoid the different scales of mass spectrums which output by different mass spectrometers will affect the training accuracy.
Ninety percent of the NIST datasets were randomly chosen as the training set for subsequent mass spectrometry and fingerprint prediction models.The remaining 10% was reserved as the testing set for evaluating the model's performance.Additionally, the SWGDRUG dataset was split into two equal halves, with one half designated for testing and the other half combined with the NIST training set to create a new training set.
Prior to testing actual sample mass spectra, the dataset was restructured since some test subjects were already present in the SWGDRUG database.These test subjects were excluded from the training set and the model was retrained.
In search of the similarity threshold, RDKit was used to extract the substructure of Cathinone.Compounds in the SWGDRUG containing this substructure were identified as Cathinone-type drugs, while the remaining compounds were considered non-cathinone-type drugs.
One hundred instances were randomly selected for each category, and similarity calculations were performed with the enumerated databases separately.

Model training and hyperparameter setting
During NEIMS training, we developed our model following the guidelines provided on the NEIMS GitHub page, experimenting with various loss functions such as KL divergence, MSE, and crossentropy.Ultimately, we opted for MSE as the most suitable loss function.The mask in hyperparameters was set to false.Regarding DeepEI training, we followed the default parameters.
S 7 Despite attempting early stop due to the extended training time, it proved ineffective, leading us to abandon its use.

Resource and Time Consumption
The enumeration of 420 million derivatives with 15 threads took approximately 15 hours.A total of 90 minutes to perform synthetic difficulty calculations by SCScore to filter functional groups with 15 threads.The speedup is linear to the number of threads used and can be easily scaled up.
The training periods for NEIMS and DeepEI were 4.5 hours and 27 hours, respectively.Computing similarities and ranking for a tested analyte typically took around 1 minute.

Ranking
The similarity between the mass spectra of two molecules, A and B, is calculated using the cosine similarity calculation method, as employed in CFM-ID 6 .In cases where the mass spectrum of A contains N intensity values and that of B contains M intensity values, the two mass spectra are initially aligned through dynamic programming, resulting in the identification of Q pairs.and   represent the m/z ratio values for molecules A and B, respectively, while and denote the       intensities of compounds A and B at the respective m/z ratios, and .The calculation of cosine     similarity between two mass spectra is represented by Equation 1.
In the calculation of mass spectrum similarity, in addition to cosine similarity, Jaccard similarity (equation 2) is also calculated.j a and j b are respectively represented as the set of m/z ratios of the top 30 intensities of compounds a and b in the mass spectrum.
S 8 The similarity calculation of the mass spectrum is the geometric mean of cosine similarity and Jaccard similarity (Equation 3).
When comparing the similarity of fingerprints, Jaccard similarity (Equation 4) was used, where and represent the fingerprints of compounds a and b, respectively.
Finally, the weighted average score (Equation 5) was calculated by combining the mass spectrum score and the fingerprint score.This combined score, termed SMSF, represents the overall similarity between the analyte and the listed derivatives from the synthetic database.
The "ranking" for each compound in the testing set was determined by applying the mwf filters and assessing the similarity between each compound's real data and the predicted mass spectrum and fingerprint from the synthetic database model.The ranking indicates the similarity of a compound's real data to the predictions of the model in comparison to all the compounds in the synthetic database.The ranking proportion for the entire testing set was then determined.For instance, a ranking proportion of 0.9 for the NEIMS model ranking 100 indicates that 90% of the compounds in the testing set are ranked within the top 100 of the entire synthetic database based on the similarity between their real mass spectra and the one predicted by NEIMS.A higher ranking proportion suggests a better similarity between the predicted results of the model and the real data, and hence, better performance of the model.

SMSF thresholding
From the SWGDRUG dataset, we randomly selected 100 cathinone-like drugs and 100 noncathinone-like drugs.We then employed similarity calculations (specifically, SMSF) against the synthetic NPS database.For cathinone-like drugs, the resulting scores were used to establish a Gaussian distribution.In contrast, for non-cathinone-like drugs, we determined the highest similarity score found in the enumeration database to create a separate Gaussian distribution.The intersection point of these two Gaussian probability functions was defined as the threshold (see Fig. S3).
When assessing an unknown substance, this threshold was applied to evaluate the likelihood that the two substances are identical, even when the most similar compound in the database is identified.
S 10    Table S2.Triple bonds in ring The feature can significantly affect the chemical and physical properties of the compound, including its reactivity and stability.

4
Allene filter Designed to avoid increased strain and reactivity due to the presence of two adjacent double bonds.

5
Acid Taut filter Designed to avoid undergo rapid interconversion between different forms in solution, which may lead to potentially complicating analyses or reactions 6 Aminal filter See [3] for detail

7
C=N filter Designed to avoid compounds that may decompose at high temperatures.

8
Decarboxy filter Implemented to exclude compounds prone to decarboxylation.

9
Enamine filter See [3] for detail 10 Enol filter Designed to avoid the increased reactivity or instability due to the presence of a highly reactive double bond adjacent to a hydroxyl group.

11
FC filter See [3] for detail

12
Geminal filter Designed to avoid the increased reactivity or instability due to steric hindrance or electronic effects.

13
Hemiacetal filter Designed to avoid the increased reactivity or instability due to the presence of a potentially reactive hydroxyl group and an acetal-forming group.

14
Hemiaminal filter Designed to avoid the increased reactivity or instability due to the presence of a potentially reactive hydroxyl group and an amine-forming group.Het-Het filter Designed to avoid the increased reactivity or instability under certain conditions due to the presence of electronegative atoms such as oxygen, nitrogen, sulfur, or halogens.16 Het-Het-Het filter Designed to avoid the increased reactivity or instability under certain conditions due to the presence of electronegative atoms such as oxygen, nitrogen, sulfur, or halogens.17 Hetero-SR filter See [3]  Compounds containing chains of aromatic nitrogens tend to be structurally complex.
Figure S1.Methamphetamine, MDMA, MDEA, and 6-MAPDB belong to the amphetamine-type stimulant (ATS) class and can be produced by modifying the structure of Amphetamine.

Figure S2 .
Figure S2.The functional groups in the figure are modifications added to the structure of cathinone based on the list of recognized cathinone-type drugs.The average synthesis difficulty of these functional groups is below the threshold for their respective depth.This ensures that existing cathinone-type illicit drugs won't be overlooked due to SCScore filtering.

Figure S3. a
Figure S3. a Random selections were made from the SWGDRUG dataset, consisting of 100 Cathinone-type drugs and 100 other types of drugs.Subsequently, the matched similarity of Cathinonetype drugs and the top similarity score of other types of drugs were computed with our enumerated database.These score distributions were then mapped to Gaussian distributions, and the threshold was identified at the intersection of these two distributions, set at 0.55.b Utilizing diverse similarity scores as thresholds, true positive rates and false positive rates were calculated and presented as ROC curves.The red dot represents the threshold 0.55 obtained in a.

Figure S4. a
Figure S4. a The performance of real standard drugs not belonging to the cathinone-type illicit drugs was assessed in the cathinone enumeration database.The x-axis represents the similarity score, while the y-axis shows the names of the 20 illicit drugs.b Among the enumerated cathinones, the similarity score distributions for the not amphetamine-like drug MAPDB.The x-axis ranges from 0.5 to 0.65 in intervals of 0.0015.c The top four ranked compounds listed in the enumerated database are all isomers of MAPDB, as shown in their respective structures.
Table of Filter Rules 1-3