Goslin: A Grammar of Succinct Lipid Nomenclature

We introduce Goslin, a polyglot grammar for common lipid shorthand nomenclatures based on the LIPID MAPS nomenclature and the shorthand nomenclature established by Liebisch and coauthors and used by LipidHome and SwissLipids. Goslin was designed to address the following pressing issues in the lipidomics field: (1) to simplify the implementation of lipid name handling for developers of mass spectrometry-based lipidomics tools, (2) to offer a tool that unifies and normalizes the main existing lipid name dialects enabling a lipidomics analysis in a high-throughput fashion, and (3) to provide a consistent mapping from lipid shorthand names to lipid building blocks and structural properties. We provide implementations of Goslin in four major programming languages, namely, C++, Java, Python 3, and R to kick-start adoption and integration. Further, we set up a web service for users to work with Goslin directly. All implementations are available free of charge under a permissive open source license.


Web Application and REST API Interactive Usage
The interactive grammar of succinct lipid nomenclatures (Goslin) web application is available at https://apps.lifs.isas.de/goslin. It provides two forms to i) upload a file containing one lipid name per line (see Supplementary Figure S1), or ii) upload a list of lipid names, defined by the user in an interactive form (see Supplementary Figure S2). The latter form also allows pasting lists of lipid names directly from the clipboard with CTRL+V. Both forms provide feedback for issues concerning every processed lipid, such as invalid names or typos (see Supplementary Figure S3), to allow the user to cross-check their data before proceeding.    After successful validation, the validated lipids are returned in overview cards (see Supplementary Figure S4), detailing their LipidMAPS classification 1 , cross-links to SwissLipids 2 and/or LipidMAPS or HMDB 3 . Additionally, the cards show summary information about the number of carbon atoms, double bonds, hydroxylations and detailed information, such as double bond position, long-chain-base status, and the bond type of the fatty acyl to the head group for each fatty acyl, if available (see Supplementary Figure S5) .

Programmatic access via the REST API
An interactive documentation for the representational state transfer (REST) application programming interface (API) of the Goslin web application is available at https: //apps.lifs.isas.de/goslin/swagger-ui.html (see Supplementary Figure S6). To illustrate its usage, we will briefly show a small example how a user can access the REST API with a standard hypertext transfer protocol (HTTP) client. Figure S6: The Goslin web application provides an interactive documentation for its REST API to simplify programmatic access.
The Structure for the request consists of a JavaScript object notation (JSON) object {} enclosing two lists, with the names lipidNames and grammars. Acceptable values for grammars are: LIPIDMAPS, GOSLIN, GOSLIN_FRAGMENTS, SWISSLIPIDS, and HMDB. A complete list is available from the interactive REST API documentation's Models section under ValidationRequest. Both fields in the ValidationRequest accept commaseparated entries, enclosed in double quotes: { "lipidNames": [ "Cer(d18:1/16:1(6Z))" ], "grammars": [ "LIPIDMAPS" ] } Sending the HTTP POST request with curl as an HTTP client looks as follows: curl -X POST "https://apps.lifs.isas.de/goslin/rest/validate" -H "accept: */*" -H "Content-Type: application/json" -d "{ \"lipidNames\": [ \"Cer(d18:1/16:1(6Z))\" ], \"grammars\": [ \"LIPIDMAPS\" ]}" The REST API will return the following result for the request, with a HTTP response code of 200 (OK). This result returns a map of properties for each lipid name that was parsed. If at least one name is not parseable, the REST API will return a response code of 400 (Client error), together with the same results reponse object. In that case, the failedToParse field in the response will contain the number of lipid names that could not be parsed. For those results where no grammar was applicable, the grammar field will contain the string NOT_PARSEABLE.¸In other cases, that field will contain the last grammar used to parse the lipid name and the messages field will contain a list of validation messages that help to narrow down the offending bits in the lipid name.

C++ Implementation
This is the documentation for the Goslin reference implementation for C++. Please be aware, that the documentation is dedicated to developers of tools for computational lipidomics who want to use cppgoslin within their project. If you are interested to run Goslin as a user, please read Supplementary Section 1. The cppgoslin implementation has been developed with the following objectives: 1. To ease the handling with lipid names for developers working on mass spectrometry-based lipidomics tools.
2. To offer a tool that unifies all existing dialects of lipid names.
It is an open-source package under the MIT License available via github 1 . For a detailed structure of the implementation, read Supplementary Section 6.

Prerequisites
The cppgoslin library needs a GNU g++ compiler version with support for the C++ 11 standard. It comes with simple makefiles for easy compilation and installation. You need the following packages: To install the library globally on your system, simply type: Be sure that you have root permissions. Here, the library and headers are installed into the /usr directory. If you want to change that location, you have to edit the first line within the makefile.

Testing cppgoslin
We set up more than 150 000 single unit and integration tests, to ensure that cppgoslin is parsing correctly. To run the tests, please type: To handle unexpected behavior, the parsing command should always be placed within a try/catch block and the LipidAdduct pointer should be deleted after usage to avoid memory leaks. Be aware when changing the installation directory, you also have to change the library directory within the examples makefile.
To retrieve a parsed lipid name on a higher hierarchy of lipid level, simply define the level when requesting the lipid name:

S10
Requesting a lipid name on a lower level than the provided will throw an exception. This functionality especially enables an easy way for computing data for histograms on lipid class or category level.

Python Implementation
This is the documentation for the Goslin reference implementation for Python 3. Please be aware, that the documentation is dedicated to developers of tools for computational lipidomics who want to insert pygoslin into their project. If you are interested to run Goslin as a user, please read Section 1. The pygoslin implementation has been developed with the following objectives: 1. To ease the handling with lipid names for developers working on mass spectrometrybased lipidomics tools.
2. To offer a tool that unifies all existing dialects of lipid names.
It is an open-source package under the MIT License available via github 3 . For a detailed structure of the implementation, read Supplementary Section 6.

Prerequisites
The pygoslin package uses Python's package management system pip to create an isolated and defined build environment. You need Python >=3.5 and the following packages to build the pygoslin package: python3-pip cython (module for Python 3) make (optional) To install the package globally in your Python distribution, simply type: Be sure that you have root permissions.

Testing pygoslin
We set up more than 150 000 single unit and integration tests, to ensure that pygoslin is parsing correctly. To run the tests, please type: This functionality especially enables an easy way for computing data for histograms on lipid class or category level. Requesting a lipid name on a lower level than the provided will raise an exception. S15

R Implementation
This project is a parser, validator and normalizer implementation for shorthand lipid nomenclatures, using the Grammar of Succinct Lipid Nomenclatures project for the R language 4 .
Goslin defines multiple grammars compatible with ANTLRv4 for different sources of shorthand lipid nomenclature. This allows to generate parsers based on the defined grammars, which provide immediate feedback whether a processed lipid shorthand notation string is compliant with a particular grammar, or not.
rgoslin uses the Goslin grammars and the cppgoslin parser to support the following general tasks: 1. Facilitate the parsing of shorthand lipid names dialects.
2. Provide a structural representation of the shorthand lipid after parsing.
3. Use the structural representation to generate normalized names.
rgoslin is an open-source package available via github 5 .

Prerequisites
This project uses the R programming language. To be able to use it, please install R 6 following the instructions for your particular operating system. rgoslin is based on native C++ code (via cppgoslin). It therefore requires additional tools on your system to compile and install it. Please see the Rcpp FAQ 7 , question 1.3 for installation details for your specific operating system.
Install the 'devtools' package with the following command.
This will install the latest, potentially unstable development version of the package with all required dependencies into your local R installation.

Using rgoslin
To load the package, start an R session and type library(rgoslin) Type the following to see the package vignette / tutorial:

vignette('introduction', package = 'rgoslin')
In order to use the provided translation functions of rgoslin, you first need to load the library.

library(rgoslin)
To check, whether a given lipid name can be parsed by any of the parsers supplied by cppgoslin, you can use the isValidLipidName method. It will return TRUE if the given name can be parsed by any of the available parsers and FALSE if the name was not parseable.
isValidLipidName("PC 32:1") Using parseLipidName with a lipid name returns a named vector of properties of the parsed lipid name.
If you want to parse multiple lipid names, use the parseLipidNames method with a vector of lipid names. This returns a data frame of properties of the parsed lipid names with one row per lipid. Finally, if you want to parse multiple lipid names and want to use one particular grammar: originalNames <-c("PC 32:1","LPC 34:1","TAG 18:1_18:0_16:1") multipleLipidNamesWithGrammar <-parseLipidNamesWithGrammar(originalNames, "Goslin") S18

Java Implementation
This project is a parser, validator and normalizer implementation for shorthand lipid nomenclatures, based on Goslin for the Java programming language 8 .
Goslin defines multiple grammars compatible with ANTLRv4 for different sources of shorthand lipid nomenclature. This allows to generate parsers based on the defined grammars, which provide immediate feedback whether a processed lipid shorthand notation string is compliant with a particular grammar, or not.
Here, jgoslin uses the Goslin grammars and the generated parsers to support the following general tasks: 1. Facilitate the parsing of shorthand lipid names dialects.
2. Provide a structural representation of the shorthand lipid after parsing.
3. Use the structural representation to generate normalized names.
Furthermore, jgoslin is an open-source package available via github 9 .

Prerequisites
This project is based on Java 11. To use it, you need a Java Runtime Environment (JRE) installed on your system. If you want to use the library in your own Java projects, you need a Java Development Kit (JDK) installed on your system. Please consult https://adoptopenjdk.net/installation.html for installation options and instructions for your operating system.

Installation instructions
Building the project and generating client code from the command-line In order to build the client code and run the unit tests, execute the following command from a terminal: ./mvnw install or on Windows:

mvnw.bat install
This compiles and tests the Java library.

Testing jgoslin
Here, jgoslin comes with a comprehensive collection of unit (JUnit 5), integration (JUnit 5) and acceptance (Cucumber) tests. You can run all of them as follows: ./mvnw verify

Using the command-line interface
The cli sub-project provides a command line interface (CLI) for parsing of lipid names either from the command line or from a file with one lipid name per line.
After building the project as mentioned above with ./mvnw install, the cli/target folder will contain the jgoslin-cli-<VERSION>-bin.zip file. Alternatively, you can download the latest cli zip file from Bintray: https://bintray.com/lifs/maven/jgoslin-cli[Search for latest jgoslin-cli-<VERSION>-bin.zip artefact] and click to download.
In order to run the validator, unzip that file, change into the unzipped folder and run java -jar jgoslin-cli-<VERSION>.jar to see the available options.
To parse a single lipid name from the command line using all available parsers, run java -jar jgoslin-cli-<VERSION>.jar -n "Cer(d18:1/20:2)" The output will tell you what is done and will echo a To use a specific grammar, instead of trying all, run java -jar jgoslin-cli-<VERSION>.jar -f lipidNames.txt -g GOSLIN To write output to the tab-separated output file 'goslin-out.tsv' instead of to the terminal, run java -jar jgoslin-cli-<VERSION>.jar -f lipidNames.txt -g GOSLIN -o If you want to use all available grammars, simply omit the -g GOSLIN argument.
Please note that you will then receive N times M lines in the output file, where N is the number of lipid names and M the number of grammars.

Using jgoslin
To integrate jgoslin in your own projects as a library, please see the README file at https://github.com/lifs-tools/jgoslin for more details.
For an overview of the domain model used by jgoslin, please see Supplementary Section 6.   Figure S7. The classes LipidCategory, LipidLevel, LipidClass, and LipidFaBondType are predefined enumerations. Here, LipidClass is being generated automatically from a list containing lipid information (name, description, category, abbreviation, synonyms) for all implementations, see Supplementary Table S1 for details. This especially eases the maintenance and ensures that the goslin implementations have the same data base. The main class unifying all classes and being provided by the parsers is LipidAdduct. It contains information about the pure lipid, the adduct as well as the fragment (if defined). The different lipid classes inherit from each other in a hierarchical fashion as defined by Liebisch et al. 4 . A dictionary with the class LipidSpecies is storing all its associated fatty acyl chains which are defined within the class FattyAcid. For storing the cummulated information on species level for the carbon length, double bonds, etc, the class LipidSpeciesInfo is utilized.