Automating and Extending Comprehensive Two-Dimensional Gas Chromatography Data Processing by Interfacing Open-Source and Commercial Software

Comprehensive two-dimensional gas chromatography (GC×GC) is a powerful analytical tool for both nontargeted and targeted analyses. However, there is a need for more integrated workflows for processing and managing the resultant high-complexity datasets. End-to-end workflows for processing GC×GC data are challenging and often require multiple tools or software to process a single dataset. We describe a new approach, which uses an existing underutilized interface within commercial software to integrate free and open-source/external scripts and tools, tailoring the workflow to the needs of the individual researcher within a single software environment. To demonstrate the concept, the interface was successfully used to complete a first-pass alignment on a large-scale GC×GC metabolomics dataset. The analysis was performed by interfacing bespoke and published external algorithms within a commercial software environment to automatically correct the variation in retention times captured by a routine reference standard. Variation in 1tR and 2tR was reduced on average from 8 and 16% CV prealignment to less than 1 and 2% post alignment, respectively. The interface enables automation and creation of new functions and increases the interconnectivity between chemometric tools, providing a window for integrating data-processing software with larger informatics-based data management platforms.

Alongside supporting tables and figures cross-referenced in the main text, instructions for using the example folders and for creating command and batch files are given in the Supporting Information. The example folders contain files equivalent to those used in the exemplar methods and have been made available via a repository at https://github.com/rcfgroup/gc-automation. Table S1.

SUPPORTING INFORMATION TABLE OF CONTENTS TABLES
S-2 Figure S1. S-5

EXAMPLE FOLDER INSTRUCTIONS
Before you begin S-6 Explanation of Example Folders S-9 Instructions S-10 1 Match Template folder 2 Export Match File folder 3 Apply Match File folder Results S-11 1 Match Template folder 2 Export Match File folder 3 Apply Match File folder

Locating the command line interface S-14
Creating a command file S-14 Creating a batch file S-16 Integrating external scripts S-17 Accessing the command line in other software S-18

MATLAB
All have advanced full-featured GUIs. All capable of common pre-processing steps based on proven methods including baseline correction, deconvolution, alignment (peak merging/matching) as well as visualisation, and comparative and statistical analysis.
-Published Peak deconvolution and detection PARAFAC and PARAFAC2 MATLAB PARAFAC2 peak detection robust to shifts in retention time by relaxing trilinearity rule. 1,2 Feature extraction / Data reduction INCA R Normalisation and then alignment based on criteria for refining peak list. Used peak list from ChromaTOF. 3

Feature extraction / Data reduction
Tile-based Fisher Ratio MATLAB In-house tile-based F-ratio software. Supervised sample classification based on the Fisher ratios of binned regions of chromatogram described as 'tiles'. 4 Peak detection / Feature extraction T-SEN R Tool for targeted analysis and screening. Worked well on GC×GC coupled with high resolution mass spectrometry data. 5 Feature extraction / Data reduction Discriminant pixel MATLAB and C Alignment based on dynamic time warping. Similar reduction method to Tile-based Fisher Ratio approach, using ANOVA and correlation significance testing on a pixel level. 6 Classification NMF R Uses NMF R package with in-house R program for masking, resolution control and normalization. Performs non-negative matrix factorization for sample classification. Preprocessing included using local alignment plug-in in GC Image. 7 Peak deconvolution and detection MCR-ALS MATLAB Multivariate curve resolution-alternating least squares analysis for handling main data challenges including baseline correction, retention time shifts, and deconvolution. 8

S-3
Feature extraction / Data reduction NPLS MATLAB Uses 2D-asymmetric least squares algorithm for baseline correction prior to N-way partial least squares algorithm for data reduction. 9 Peak detection DotMap MATLAB Pixel-based method based on mass spectral matching, used for targeted analysis and screening. Performed PARAFAC prior to analysis. 10 Peak detection msPeak and msPeakG R Finds peak regions based on Normal-Exponential-Bernoullo and in later version by Normal-Gamma-Bernoulli models and then detects peaks within region based on probability models. 11 25 Workflow / metabolomics pipeline Guineu Java Java-based. Modular design for user development. Alignment, normalisation, artefact removal, retention indices and functional group identification. First produce peak list from ChromaTOF. 26 Workflow/ metabolomics pipeline RMet R Import CDF files, supports visualisation, peak detection using MCR-ALS model and multivariate analysis. 27 Workflow / metabolomics pipeline RGCxGC R Import netCDF files. Signal pre-processing including baseline correction and smoothing, alignment using 2D-COW algorithm and multivariate analysis. 28 S-5 Figure S1. Example of a multi-stage process (e.g. an alignment based on batch number) automated by interfacing commercial and open-source software S-6

EXAMPLE FOLDER INSTRUCTIONS
To help users with the basics of running their workflow steps through the command line interface, we recommend either running the example scripts in the example folders (at https://github.com/rcfgroup/gcautomation) or first converting a current data processing method (.method) into a command file. Further instructions for running the example scripts are provided on pages S-6 -S-13 of this Supporting Information, in the files and in the Github repository. Converting a current method can be executed through the software GUI and instructions and familiarization with the command and batch file format are provided on pages S-14 -S-17 of this Supporting Information. These instructions and examples allow the user to become familiar with the format and check for errors before moving onto more complex custom tasks. A summary step-by-step list is provided below, with further details of each step provided in the 'Before you begin' section.
Step-by-step check list

Before you begin
The commercial and free and open-source software (FOSS) for your own custom workflow would already be installed. However, for this specific example, the following steps should be followed to ensure the example folders work as described.
Installing commercial software Ensure a copy of the commercial software (in this instance GC Image) is installed (the examples were generated in version 2.8r2). A free trial of the software is available on request from https://gcimage.com/gcxgc/trial.html.

Installing plug-ins
The examples describe interfacing a published alignment algorithm. 15 This algorithm is freely available as a plug-in at http://gcimage.com/forum/viewtopic.php?f=5&t=104 via the website under Plug-ins, for version 2.6 or later, and must be installed in order for the second and third example folders to work. To access the plug-in, login to your (free) user account. The original Matlab tool is available at https://github.com/jsarey/GCxGCalignment. Once the plug-in file has been downloaded and extracted to the GC Image program folder, open GC Image, go to 'Tools' in the menu bar and from the list select 'Manage Plugins'. Click the 'Import' button and locate and import the plug-in file. The 'Natural Neighbour (NIES-EPFL)' plug-in should now appear in the list of imported plug-ins. Click 'Configure' and change the parameters to match those below. Click 'OK' and close the program.
Installing FOSS Ensure a compatible version of the free or open-source programming software (in this instance Python) is installed along with the packages used in the script (in this instance the 'click' and 'pandas' packages). Python is freely available at https://www.python.org/downloads/. If using a computer operating Windows 10 or later, you can download the Python app from the Microsoft Store.

S-7
To check if Python has been installed, in Windows press the windows key () + R and type 'cmd' (without '' marks) in the Run window. This opens the command prompt window. Type 'python' and press enter. If Python has been installed it will come up with the program details as shown in Figure S2. Figure S2: Checking the Python installation in the command prompt window.
To install the necessary packages, check if the package management system 'PIP' is installed. To do this in Windows, press the windows key () + R and type 'cmd' in the Run window. This opens the a command prompt window. Next type 'pip -V'. If this returns an error, PIP has not been installed. To install PIP, go to https://bootstrap.pypa.io/ and download the 'get-pip.py' file. Save the file to a known location. In the command prompt window, type 'cd' followed by the location of the get-pip.py file ( Figure S3). Next, type 'python get-pip.py' into the command prompt and PIP will be installed. ( Figure S3). PIP can now be used to install the necessary packages by typing 'pip install click' and then 'pip install pandas'. Following each command, the package will be installed. Lastly, import the packages by first entering 'python', this will show the Python program details acknowledging that you are now in the python interpreter ( Figure  S4), and then type 'import click' and then 'import pandas'. You can now close the command prompt window Figure S4: Importing packages for the external scripts used in the folders.

Viewing files
For those inexperienced with basic coding we recommend installing Notepad++ to view the command and batch files. This software is freely available at https://notepad-plus-plus.org/downloads/.
Extract the zip file before using and ensure the 'Example folders' folder is extracted (or copied once extracted) to the 'C:\temp' location. If the 'temp' folder can not be located create a new folder.
The 'Examples folder' (now with a path at 'C:\temp\Example folders') contains three folders: 1 Match Template, 2 Export Match File and 3 Apply Match File. These three folders are examples to help a beginner become familiar with the concept of using the command line interface to integrate free and open-source software with commercial software for GC×GC data processing. The examples in the folders wouldn't necessarily be used independently as part of a workflow but effectively demonstrate this new way of being able to process GCxGC data.
Please note, if using a different version of the commercial software (not v2.8r2), you may need to change the path for the Command Line. In the first and second example folders, this can be changed in the batch (.bat) file ( Figure S5A). In the third example folder, this can be changed in the python (.py) file (line 41, keeping the double slash formatting). The command line interface can be found in the program directory. An example path is: C:\GC Image\GC Image 2.8r2 GCxGC (64-bit)\bin\CommandLine.bat.

S-8
Right-click on the .bat file in the first or second example folder, or the .py file for the third example folder, and select 'Edit with Notepad++'. In the .bat file, under 'set GC_IMG=' paste the new path and then save the .bat file (as in Figure S5A). In .py file, under line 41 after 'cmd =' enter the new path using the double slash format shown in Figure S5B. When running the batch files (as described on page S-10) using a different version of the commercial software, the following message may appear.
Click 'OK' and the analysis will continue. The message will appear three times for each file being processed, click 'OK' to continue. This is not an issue for completing the examples, however, for custom workflows with more files the version information should be changed in the command and batch files. S-9

Explanation of example folders
The first folder (1 Match Template) is a simple exercise, telling the commercial software to perform two tasks using the command line interface. These steps can also be performed in the GUI. This example is to get the user familiar with the translation between the GUI and command and batch files.
The second folder (2 Export Match File) tells the commercial software to perform to two simple tasks again (matching a template and exporting a summary report) but this time the batch file incorporates an external FOSS, in the form of a Python script. Once the commercial software completes the tasks, the Python script uses the outputs (template file and summary report) to automatically generate an exported match file. The generation of multiple match files is a task which can't be batch processed within the GUI.
The third folder (3 Apply Match File) uses the processed chromatograms (Output folder) from the first folder (1 Match Template) and the exported match files (Exported Match Files folder) from the second folder, to perform a more complex list of commands, aligning the chromatograms using the Gros et al local alignment algorithm (available as plug-in). 15 This time the commercial software is run through the Python script (as opposed to the Python script appending to the end). The Python script performs an automatic match between the batch number in the exported match files and the batch number in the chromatogram fiename. Once it has matched them, it uses the external algorithm automatically align each chromatogram using each batch's unique match file. This intelligent iterative matching can not be performed within the GUI.
The example data in the first and third folder are chromatograms of a simple reference mixture with small variation in secondary retention times between the samples, especially for compounds with higher 2 t R (e.g. aromatics). The files have three different batch numbers (180808 15, 180815 1, and 180815 9) representing samples run at different times The example data in the second folder are chromatograms of an n-alkane and aromatics mixture run on the corresponding dates as the reference mixture with the same batch numbers. To save memory space and make the demonstration as simple as possible the chromatograms are GC×GC-FID files, which have been exported a model image and saved as .gci files.

Match Template folder -Instructions
In the 'Input' folder there are three chromatograms (six files in total, three .gci and three .bin files) labelled 'ref mix' with three different batch numbers (180808 15, 180815 1, and 180815 9). The command line window will open showing each file being processed. The command window will then close and three processed chromatograms (six files in total, three .gci and three .bin files) will now have appeared in the Output folder. The number of chromatograms can be changed (e.g. >00s), and the processing time will change accordingly, however, this example highlights the simplicity of sharing and repeating a workflow and gets the user familiar with the interface. (see Results on page S10).

Export Match File folder -Instructions
In the 'Input' folder there are three chromatograms (six files in total, three .gci and three .bin files). To run the analysis simply double-click the .bat file. The command line window will open showing each file being processed. The command window will then close and three processed chromatograms (six files in total, three .gci and three .bin files) will now have appeared in the Output folder. In the 'Exported Match Files' folder three .csv files will now have appeared. These are the exported match files for each batch based on the change in retention positions of the n-alkanes and aromatic compounds in the mixture. (see Results on pages S10).

Apply Match File folder -Instructions
Firstly, copy the files of the three chromatograms of the reference mixture (.gci and .bin files) from the Output folder of the first example folder (1 Match Template), and paste them into the Input folder in this folder. Next, copy the three exported match files (.csv files) from the Exported Match Files folder of the second example folder (2 Export Match Files), and paste them into the Exported Match Files folder in this folder.
To run the analysis simply double-click the .bat file. The command line window will open showing each file being processed ( Figure S6). The command window will then show a message saying 'Process finished' and 'Press any button to continue'. Press the 'enter' button on the keyboard and the command line window will close. Three processed chromatograms (six files in total, three .gci and three .bin files) will now have appeared in the Output folder. In the 'Exported CSV Files' folder three .csv files will now have appeared. These are the chromatograms exported as a single-column vector .csv file. The processed chromatograms can be reviewed in Investigator (see Results section on pages S12 -S13).

Match Template folder -Results
The Input and Output files can be viewed in GC Image. Open GC Image and click the yellow folder icon in the top left corner ( Figure S7). Navigate to the C:\temp\Example folders\1 Match Template\Input folder, select a chromatogram to open. You will see a series of peaks, however, in the chromatograms from the Input folder the chromatograms will appear unprocessed without any peaks (or 'blobs) detected. In the chromatograms from the Output folder the chromatograms will have the series of peaks detected (with yellow and red outlines around them) and the peaks above the hydrocarbon series with yellow outlines (the carbonyl, terpene and aromatic peaks) will have been matched to the template and have compounds names associated with them. Figure

Export Match File folder -Results
The chromatograms in the Input folder show the separation of a mixture of n-alkane and aromatic compounds separating across the 2D chromatographic space. The peaks were detected and a template applied labelling the peaks across all the samples. This can be done in the commercial software GUI or using the interface as demonstrated in the first example. The chromatograms in the Output folder are the same, only the alignment template, comprising a subset of the compounds, was applied to the peaks and a summary report of the peak positions generated as specified in the commands file. The Python script in the folder took the details of the reference positions of the subset of peaks from the template file, and the actual retention positions of the subset of peaks in the chromatogram from the summary report and combined them to reproduce match files (as can be produced one at a time in the GUI) for each of the chromatograms automatically. This example demonstrates how interfacing commercial software and free and open-source software can be used to automate custom iterative tasks.

S-12 3 Apply Match File folder -Results
The Input and Output files i.e. the chromatograms pre-and post-alignment using the Gros et al local alignment algorithm, can be compared in Investigator (part of the GC Image suite).
To do this, firstly open Investigator. In the top left of Investigator go to 'File' and click 'Load Images'. Navigate to the C:\temp\Example folders\3 Apply Match Files\Input folder, select all three chromatograms holding the shift key and click 'Open'.
In the 'Load Options' window (see Figure S8), uncheck the 'Use configuration' box and in the 'Features' tab select the 'As is' radial button. Next, click on the 'Attributes' tab, select 'Analyse specific blob/area attributes' and select the 'Choose retention and response attributes' radial button. In the list only check the 'Retention I', 'Retention II' and 'Volume' boxes, and uncheck the 'Analyse specific blob set attributes' box ( Figure S9). In the 'Class Assignment' tab, select all three files by clicking whilst holding the shift key, click 'Assign Class Label' and choose 'New Label'. In the box enter 'Pre' or an equivalent label and then click 'Ok', and then click 'Ok' again in the bottom left hand corner on the window. Investigator will then load the chromatograms into the program. Figure S8: Loading the processed chromatograms in Investigator to compare the secondary retention times pre-and post-alignment Next, repeat the above to load the processed chromatograms, this time navigating to the C:\temp\Example folders\3 Apply Match Files\Output folder, selecting all three chromatograms holding the shift key and click 'Open'. All the options in the 'Load Options' window will now be greyed out, except for the 'Class Assignment' tab; highlight the new files and label them 'Post' or an equivalent label.
All six chromatograms should now be loaded into Investigator. If available in your version of the software, click 'Analysis' in the menu bar at the top and select 'Custom settings'. In the 'Custom Analysis Settings' window, only check the 'Class Mean', 'Class Stdev', 'Class %RSD' and 'Pairwise Mean Difference' boxes and click 'Ok'. If 'Analysis' is not available, go to the next step.
Next, click the 'Attributes' tab (see Figure S9) and in the left-hand column select 'Retention II'. In the table on the right-hand side it is now possible to compare the %RSD(Pre) and %RSD(Post) columns. If 'Analysis' was not available to define the %RSD columns, the comparison can still be made by scrolling across to see the RSD columns labelled 'Pre' and 'Post'. For all the compounds, the variation in secondary retention time is reduced in the post-alignment chromatograms ( Figure S9). This analysis was implemented by double-clicking a .bat file. For further information on using Investigator please refer to the manufacturer guidance. In this example, the reference mixture of terpenoids, aromatics and carbonyls represents samples, but for demonstration purposes using a reference mixture with known compounds (independent of the alignment mixture) helps demonstrate the workflow implemented through the interface. S-14

FAMILIARISATION AND CREATION OF NEW INTEFACE FILES
The steps for integrating external scripts are described and examples that include external scripts (with example data) have been made available at https://github.com/rcfgroup/gc-automation. However, as the integration of external scripts is to increase flexibility, the final implementation is up to the user.
The steps described below are for using with GC Image v2.8 or later. A free trial of the software is available on request from https://gcimage.com/gcxgc/trial.html. Similar methods can be developed using different software. An alpha-version of a Python interface has been developed that can be made compatible with any software; automatically generates the command and batch files and can be made available on request.

Locating the command line interface
The command line interface can be accessed through the program directory. The command line can be found in the program folder, an example file path is shown below: C:\GC Image\GC Image 2.8r2 GCxGC (64-bit)\bin\CommandLine.bat If multiple versions of the software are installed, locate the command line in the version you wish to use.

Creating a command file
The command file is the list of processing steps the user would like to perform on the data files (e.g. baseline correction, peak detection) based on the main functions in the commercial software. The command file can be created by directly exporting the method from the software. The individual processing steps can also be accessed directly from the program directory.
For beginners, we recommend using Notepad or Notepad++, a free user-friendly software for basic programming. A new command file can be created in a text file .txt and then saved as a .process.cmd file, or the example .cmd files provided in the github repository can be saved and edited in Notepad++.
The command file has the following structure: -Configure C:\temp\configuration file.cfg -script <script> <cmd name = "command name" label= "label name"> <timeStamp> date and time </timeStamp> <username> local user </username> <cmdversion> software version </cmdversion> <do> <parameter id = "parameter" label = "parameter label"> file path </parameter> </do> </cmd> </script> ## Start with -Configure followed by the file path and filename of the configuration file ## .cfg your method and data processing steps relies on. ## All commands e.g. processing steps, within each set of <cmd> and </cmd> tag ## operators should be kept within the -Script <script> and </script> tag operators and ## can be copied directly from the exported method as described below.
## The copied information includes the <do> </do> tag operators which contains any ## additional parameters needed for that command, with each parameter defined within ## <parameter> </parameter> tag operators.
## The Match Template command requires additional ## parameters such as the template file (.bt), which is given between ## the <parameter> </parameter> tag operators. The user can use any ## number of commands. These parameters mirror the options available ## in the software 'Method Editor' GUI (GC Image -> Method -> Manage ## & Run Methods -> Edit Method -> Apply Template option).

S-15
Create a folder, labelled with the name of the workflow or process, and save the command file and batch file within it. The folder needs to contain an input folder, output folder and any additional files used in the analysis. We recommend using the example folder trees in the Github repository provided.
An example of the folder tree for the example above, saved in the C:\temp folder, is shown below: The folder contains an input folder, an output folder, the configuration file, the command file, the batch file and additional files that the method requires such as a template file. Then copy all the commands, including the <cmd> </cmd> tag operators and all the information between the operators, to the command file between the <script> </script> tag operators.