MolBook UNIPI—Create, Manage, Analyze, and Share Your Chemical Data for Free

Here, we present MolBook UNIPI, freely available and user-friendly software specifically designed for medicinal chemists as a powerful tool for the easy management of virtual libraries of chemical compounds. With MolBook UNIPI, it is possible to create, store, handle, and share molecular databases in a very simple and intuitive way. The software allows users to rapidly generate libraries of bioactive ligands, building blocks, or commercial compounds by either manually creating single molecules or automatically importing compounds from public databases and pre-existing libraries. MolBook UNIPI databases can be enriched with all kinds of data and can be filtered based on molecular structures or properties, allowing the desired molecules, along with their structures and features, to be easily accessible in just a few clicks. Moreover, new molecular properties and potential toxicological effects of compounds can be rapidly and reliably predicted. Notably, all of these functions can be easily mastered even by inexperienced users, with no prior cheminformatics knowledge or programming skills, which makes MolBook UNIPI an invaluable tool for medicinal chemists. MolBook UNIPI can be downloaded free of charge from the project web page https://molbook.farm.unipi.it/.


Table of Contents
. Calculation and prediction of properties. Figure S3A represents the VenomPred tool for toxicity predictions, while Figure S3B shows a molecule entry with several computed properties. Figure S4. Property query functionality. Application of an example query for searching for compounds on the basis of molecular weight, LogP and chemical formula.
S5 Figure S5. Structural query functionality. Application of an example query for searching for compounds presenting a coumarin core by employing the substructure option.

Case C: Filtering of natural compounds
From a medicinal chemistry point of view, the evaluation of the structure of the molecules in the optic of future pharmaceutical development is essential. This evaluation includes analyses to identify and avoid compounds that may generate interference during their biological assessments, while prioritizing those showing a promising potential to become candidate drugs. In this context, we propose the analysis of a set of 1000 compounds obtained from the public COCONUT 1 database in order to obtain molecules with a suitable drug-like structure.  Figure S4). In particular, the query will be applied to the property called "Drug-likeness (RO5)" to retrieve compounds with the "Passed" flag. MolBook UNIPI also includes the functionality to evaluate if a molecule could generate interference when subjected to biological assays. This analysis is based on a search for substructures reported as potential cause of pan-assay interference. Compounds presenting such substructures, which could thus behave as pan-assay interference compounds (PAINS), 2 can be identified using the built-in "Calculate PAINS" tool present in the database menu under PAINS filer. The corresponding window allows user to select the evaluation of the molecules according to three different filters that can be run simultaneously. The result of each filter will be stored in the newly generated columns of the main table. Each result will contain the number of interfering substructures detected in the S7 molecules. In this context, examining the molecular moieties that generated a PAINS alert is essential in order to understand the results. For this purpose, the PAINS mapping tool provides a grid view interface where the user can select the PAINS filter to highlight the matching substructures that have generated alerts in the analyzed molecules. The tool allows users to switch between the PAINS A, B and C filters, highlighting the different groups of PAINS substructures with different colors (see Figure S6). Furthermore, the selected structures of the grid view interface are automatically selected into the project table facilitating the user to export or delete only the desired entries. conversions into the specified file formats during data import/export. Internally, MolBook UNIPI uses a pandas dataframe as a container to hold molecule data. A pandas dataframe is easy to handle and allows direct implementation with the pyqt5 library functions. The software is equipped with an error-handling system that prevents it from crashing due to bugs present in the source code and allows it to keep running properly even after an error event. Secondly, the user can report to our team the error shown within the pop-up window ( Figure S7), which is displayed in case of errors, using the form available on the official MolBook UNIPI website. Chemical structure processing S9 MolBook UNIPI relies on the RDKit library for loading and handling molecular structures. In particular, the software accepts input structures as sdf or csv/xlsx files, the latter containing a column with SMILES notations (see Data management for more details). These data are loaded with RDKit's dedicated function that internally generates the molecular graph containing all the information on the atoms of the compounds. In all cases, the compounds are converted into RDKit mol objects and subjected to a "sanitization" protocol. 3 This procedure is often used to ensure the consistency of the chemical structure in order to avoid atoms with undesirable hybridizations or unusual bonds. The result of the sanitization is a molecular graph that can be represented with Lewis dot structures complete with octets. The steps involved in the protocol scheme and their description are reported in Table S1. Data management The working system of MolBook UNIPI is based on the creation of projects that are graphically displayed as tabs within which a database of molecules is hosted. The users can switch among different projects while keeping data separate and performing several tasks, which are described below. Each individual project can be saved to a path specified by the user. Specifically, a MolBook UNIPI project is stored as a folder including all information needed for consultation. Saved projects can be either accessed with the incorporated project load function, which requires manual selection of the project folder by the user, or directly loaded using the recent projects submenu. The compound data are displayed in a table that allows user to quickly consult the chemical structure and associated information. In particular, each row in the table is associated with a molecule with an identifier ID that must be unique, while the structure of the molecule is stored as 2D representation encoded by SMILES notation. Two options are available for displaying the data: the "Classic" view, which exhibits one molecule structure at a time, and the "Image

Toxicity predictions
VenomPred platform for in silico toxicity predictions, recently developed by our team, 7 was integrated into the software. The platform employs machine learning (ML) models trained with S13 experimentally evaluated toxicological data to predict the potential toxicity of chemical compounds in relation to four endpoints: mutagenicity, carcinogenicity, hepatotoxicity and estrogenicity. The training and test set compounds respectively used to train and evaluate the models related to all endpoints were retrieved from VEGA, a freely available toxicity assessment software. The compounds were converted into molecular fingerprints to provide binary vectors suitable for ML model fitting. For each endpoint, the compounds were represented by 5 different chemical fingerprints (FPs). The FPs were then combined with four ML algorithms, yielding 20 different models per endpoint, whose hyperparameters were properly optimized. The models were subjected to internal cross-validation and external test set validation. For each endpoint, our models achieved better or comparable performance with respect to the reference models included in VEGA. In order to improve the performance of the ML models, we applied a consensus strategy that combined the predictions of multiple models. Such strategy demonstrated to achieve a higher predictive performance; therefore, the best model combination for each endpoint was included in VenomPred platform. VenomPred returns a probability value in the range between 0 and 100 indicating the potential toxicity of small molecules in relation to a specific endpoint. A compound is classified as toxic if the probability is equal to or greater than 50, while a probability below 50 indicates a nontoxic profile. Precisely, a value closer to 100 corresponds to a highly confident prediction of potential toxicity. Similarly, a probability close to 0 represents a non-toxic prediction with high confidence. In this context, it is relevant to mention that the speed of the toxicological predictions performed by VenomPred through MolBook UNIPI depends on the hardware specifications of the computer running the software.

Query search
MolBook UNIPI projects can be easily queried to retrieve entries of molecules that have certain properties calculated by the software or added by the user, as well as to obtain compounds that have a structural similarity with respect to a molecule defined as a query. The "Property Query" widget allows user to define several queries that are applied to filter compounds and to identify only those S14 whose properties satisfy all the criteria specified by the user. A query on a single property is defined by three parameters: the property of the molecule to be examined, the comparison criteria, and the query value. The results are displayed in a new project that preserves the properties associated with each molecule retrieved through the query. A project can also be queried in terms of chemical structure matching with the "Structural Query" widget. By using this widget, it is possible to draw, using JSME sketcher, the chemical structure of a molecule that is used as a reference (query) to perform the comparison with the compounds in the project. Three search approaches are available for the "Structural Query": similarity, substructure and superstructure. The similarity principle is based on converting the compounds to be examined into a chemical fingerprint. MolBook UNIPI employs the RDKit library function to calculate Morgan fingerprints, 8 setting the vector length to 2048 bits and the atom radius to 2. Fingerprint similarity is calculated with the Tanimoto index, 9 which returns a value between 0 and 1, where 1 indicates that two vectors correspond to the same compound. A vertical slider is present in the "Structural Query" widget for setting the similarity threshold, which is shown in percentage values. The substructure approach relies on searching for the presence of the query molecule in the chemical structure of the compounds in the project. On the other hand, the superstructure search method identifies molecules whose structure represents a part of the structure specified as a query. Both methods are performed through the built-in RDKit molecule objects methods. Analogously to the "Property Query," the results of the "Structural Query" are displayed as new projects browsable in the corresponding tabs of the main software window. S15