Machine Learning-Guided Optimization of p-Coumaric Acid Production in Yeast

Industrial biotechnology uses Design–Build–Test–Learn (DBTL) cycles to accelerate the development of microbial cell factories, required for the transition to a biobased economy. To use them effectively, appropriate connections between the phases of the cycle are crucial. Using p-coumaric acid (pCA) production in Saccharomyces cerevisiae as a case study, we propose the use of one-pot library generation, random screening, targeted sequencing, and machine learning (ML) as links during DBTL cycles. We showed that the robustness and flexibility of the ML models strongly enable pathway optimization and propose feature importance and Shapley additive explanation values as a guide to expand the design space of original libraries. This approach allowed a 68% increased production of pCA within two DBTL cycles, leading to a 0.52 g/L titer and a 0.03 g/g yield on glucose.


TABLE OF CONTENTS Supplementary Tables
Promoters, ORF and terminator sequences Table S2: Integration sites used for strains construction Table S3: List of strains used in this study

FigureFigure S12 .
Figure S1.Library design and construction Figure S2.Promoter-terminator characterization Figure S3.pCA production correct strain PAL library Figure S4.Model selection and training strategies Figure S5.Ranking of BMP, top 5 BMP and worst producers Figure S6.Genotype top 10 predicted strains all learning strategies Figure S7.Phenotype top 10 predicted strains all learning strategies Figure S8.Validation ML predictions Figure S9.Additional promoters CPR 10 Figure S10.Feature importance 10 Figure S11.SHAP values 11 Figure S12.Effect of training data size on model accuracy 12 Supplementary References 12

Figure S1 :
Figure S1: A. Cassettes used for library transformation.A cassette is a combination of promoter, ORF and terminator.Cassettes containing promoter-ORF combinations shown in orange could not be obtained.B. Schematic representation of the integration of a gene cluster.Connector sequences a to g represent homology regions for in vivo recombination of cassettes (C1 to C6) and the selection marker cassette (SM); flank sequences homologous to the genome integration site are shown as F.

Figure S2 :
Figure S2: Promoter-terminator characterization by GFP fluorescence measured using fluorescence activated cell sorting (FACS) (A).Zoom-out for the fluorescence values (B).

Figure S3 :
Figure S3: Characterization of the correct sequences from the PAL library.Frequency of strains with the same designs (left) and average ρ CA production of designs with replicates (right).

Figure S4 :
Figure S4: Model selection and training strategies.Genotype and production data were divided in two datasets: the complete and producer datasets that differ in the inclusion of data from nonproducers.Each dataset was used for hyper-parameter (HP) tuning of four ML models: multiple linear regressor (MLR), support vector regressor (SVR), kernel ridge regressor (KRR) and random forest regressor (RFR).Accuracy of models with optimal HP was evaluated on the test sets and is shown in the table.For each dataset, two learning strategies were applied: one train training, where all the data was used for training and recurrent training, where 90% of the training data was iteratively used for training.

Figure S5 :
Figure S5: Ranking of top measured producer, top 5 measured producers and non-producers based on four different training strategies.Results are given per model used in each strategy: MLR, multiple linear regressor; KRR, kernel ridge regressor; SVR, support vector regressor; RFR, random forest regressor.*Ranking of non-produces excludes design 560 which is predicted to produce by all models independently of the training strategy.The BMP was ranked in the top 0.1% to 15% depending on the model used and regardless of the training strategy.Similarly, the 5-BMPs were always predicted to be, at least, in the top 22% of the library.Measured non-producers were ranked in the bottom 46% or 60% of the library depending on the dataset used for training (complete or producers respectively).Therefore, including non-producers during training did not change predictions of top producers, but improved predictions of non-producers, ensuring correct coverage of the complete library by the ML predictions.

Figure S6 :
Figure S6: Summary of top 10 predicted producers by each learning strategy.Designs are ranked based on the frequency (Freq.)they are chosen as top 1 (T1), top 5 (T5) or top 10 (T10) by different models or data points included during training.Factor 1 refers to ARO4 except an E is shown (ENO1), factor 2 refers to AROL except a 1 is shown (ARO1), factor 3 refers to ARO7 unless a P is shown (PHEA), factor 4 refers to PAL, factor 5 refers to C4H and factor 6 refers to CPR.Predicted designs shared by different learning strategies are linked by lines or highlighted in grey.* indicates designs equal to the best measured producer strain (BMP strain).

Figure S7 :
Figure S7: Predicted pCA production by the top 10 ranked strains found using four different leaning strategies.Production relative to the predicted production of the top measured producer is shown.CO, complete dataset with one time training; CR, complete dataset with recurrent training; PR, producers dataset with one time training; PR producers dataset with recurrent training.

Figure S8 :
Figure S8: Validation of ML predictions.Comparison of measured and predicted relative production of the predicted top producers relative to the production of the best measured produced (BMP).Each of the left panels represents a different learning strategy .Genotypes of the plotted strains are shown in the right panel where * indicates strains equal to BMP.Factor 1 refers to ARO4, factor 2 refers to AROL expect when 1 is shown (ARO1), factor 3 refers to ARO7 except when P is shown (PHEA), factor 4 refers to PAL; 5 to C4H and 6 to CPR.Promoter strengths are represented by colour intensity.Strain names are defined based on the ranking they belong and their position in the ranking, when two strains share the same position they are followed by (1) and (2).CO, complete dataset with one time training; CR, complete dataset with recurrent training; PR, producers dataset with one time training; PR producers dataset with recurrent training.

Figure S9 :
Figure S9: Effect of substituting the CPR promoter in two different hosts: the best measured strain (BMP) and the top producer in the CR ranking (complete dataset recurrent training).

Figure S10 :
Figure S10: Permutation feature importance results obtained using the complete (A) or the producers (B) datasets.

Figure S12 :
Figure S12: Effect of training data size on accuracy of predictions of different ML algorithms (MLR, multiple linear regression; SVR, support vector regression; KR kernel ridge regression; RFR, random forest regression) with the complete or producers dataset.Negative R 2 values obtained for some test-train splits were omitted for the calculation of mean and std (this values are obtained when the average of the training data is a better estimator than the trained model).

Table S4 :
List of plasmids used in this study

Table S5 :
List of primers used in this study

Table S6 :
Features of sgRNA and crRNA design

Table S1 : Promoters, ORFs and terminator sequences
See attached Excel file.