Enhancing Semiempirical Quantum Mechanical Scoring with Machine Learning: a new scoring function that accounts for both the enthalpic and entropic contributions to the ligand binding free energy

Thomas Evangelidis; Ilektra-Chara Giassa; Mario Lovrić

doi:10.26434/chemrxiv-2022-68n6h

Enhancing Semiempirical Quantum Mechanical Scoring with Machine Learning: a new scoring function that accounts for both the enthalpic and entropic contributions to the ligand binding free energy

Thomas Evangelidis^*, Ilektra-Chara Giassa, Mario Lovrić

^*Corresponding author for this work

Know-Center GmbH Research Center for Data-Driven Business & Big Data Analytics (98770)

Research output: Working paper › Preprint

Abstract

Identifying hit compounds is a principal step in early-stage drug discovery. While many machine learning (ML) approaches have been proposed, in the absence of binding data, molecular docking is the most widely used option to predict binding modes and score hundreds of thousands of compounds for binding affinity to the target protein. Docking's effectiveness is critically dependent on the protein-ligand (P-L) scoring function (SF), thus re-scoring with more rigorous SFs is a common practice. In this pilot study, we scrutinize the PM6-D3H4X/COSMO semi-empirical quantum mechanical (SQM) method as a docking pose re-scoring tool on 17 diverse receptors and ligand decoy sets, totaling 1.5 million P-L complexes. We investigate the effect of explicitly computed ligand conformational entropy and ligand deformation energy on SQM P-L scoring in a virtual screening (VS) setting, as well as molecular mechanics (MM) versus hybrid SQM/MM structure optimization prior to re-scoring. Our results proclaim that there is no obvious benefit from computing ligand conformational entropies or deformation energies and that optimizing only the ligand's geometry on the SQM level is sufficient to achieve the best possible scores. Instead, we leverage machine learning (ML) to include implicitly the missing entropy terms to the SQM score using ligand topology, physicochemical, and P-L interaction descriptors. Our new hybrid scoring function, named SQM-ML, is transparent and explainable, and achieves in average 9\% higher AUC-ROC than PM6-D3H4X/COSMO and 3\% higher than Glide SP, but with consistent and predictable performance across all test sets, unlike the former two SFs, whose performance is considerably target-dependent and sometimes resembles that of a random classifier. The code to prepare and train SQM-ML models is available at \url{https://github.com/tevang/sqm-ml.git} and we believe that will pave the way for a new generation of hybrid SQM/ML protein-ligand scoring functions.

Original language	English
DOIs	https://doi.org/10.26434/chemrxiv-2022-68n6h
Publication status	Published - 23 Dec 2022

Access to Document

10.26434/chemrxiv-2022-68n6hLicence: CC BY 4.0

Cite this

@techreport{f8401c789c8b466fb18bcdd8af59d5e9,

title = "Enhancing Semiempirical Quantum Mechanical Scoring with Machine Learning: a new scoring function that accounts for both the enthalpic and entropic contributions to the ligand binding free energy",

abstract = "Identifying hit compounds is a principal step in early-stage drug discovery. While many machine learning (ML) approaches have been proposed, in the absence of binding data, molecular docking is the most widely used option to predict binding modes and score hundreds of thousands of compounds for binding affinity to the target protein. Docking's effectiveness is critically dependent on the protein-ligand (P-L) scoring function (SF), thus re-scoring with more rigorous SFs is a common practice. In this pilot study, we scrutinize the PM6-D3H4X/COSMO semi-empirical quantum mechanical (SQM) method as a docking pose re-scoring tool on 17 diverse receptors and ligand decoy sets, totaling 1.5 million P-L complexes. We investigate the effect of explicitly computed ligand conformational entropy and ligand deformation energy on SQM P-L scoring in a virtual screening (VS) setting, as well as molecular mechanics (MM) versus hybrid SQM/MM structure optimization prior to re-scoring. Our results proclaim that there is no obvious benefit from computing ligand conformational entropies or deformation energies and that optimizing only the ligand's geometry on the SQM level is sufficient to achieve the best possible scores. Instead, we leverage machine learning (ML) to include implicitly the missing entropy terms to the SQM score using ligand topology, physicochemical, and P-L interaction descriptors. Our new hybrid scoring function, named SQM-ML, is transparent and explainable, and achieves in average 9\% higher AUC-ROC than PM6-D3H4X/COSMO and 3\% higher than Glide SP, but with consistent and predictable performance across all test sets, unlike the former two SFs, whose performance is considerably target-dependent and sometimes resembles that of a random classifier. The code to prepare and train SQM-ML models is available at \url{https://github.com/tevang/sqm-ml.git} and we believe that will pave the way for a new generation of hybrid SQM/ML protein-ligand scoring functions.",

author = "Thomas Evangelidis and Ilektra-Chara Giassa and Mario Lovri{\'c}",

year = "2022",

month = dec,

day = "23",

doi = "10.26434/chemrxiv-2022-68n6h",

language = "English",

type = "WorkingPaper",

}

TY - UNPB

T1 - Enhancing Semiempirical Quantum Mechanical Scoring with Machine Learning: a new scoring function that accounts for both the enthalpic and entropic contributions to the ligand binding free energy

AU - Evangelidis, Thomas

AU - Giassa, Ilektra-Chara

AU - Lovrić, Mario

PY - 2022/12/23

Y1 - 2022/12/23

N2 - Identifying hit compounds is a principal step in early-stage drug discovery. While many machine learning (ML) approaches have been proposed, in the absence of binding data, molecular docking is the most widely used option to predict binding modes and score hundreds of thousands of compounds for binding affinity to the target protein. Docking's effectiveness is critically dependent on the protein-ligand (P-L) scoring function (SF), thus re-scoring with more rigorous SFs is a common practice. In this pilot study, we scrutinize the PM6-D3H4X/COSMO semi-empirical quantum mechanical (SQM) method as a docking pose re-scoring tool on 17 diverse receptors and ligand decoy sets, totaling 1.5 million P-L complexes. We investigate the effect of explicitly computed ligand conformational entropy and ligand deformation energy on SQM P-L scoring in a virtual screening (VS) setting, as well as molecular mechanics (MM) versus hybrid SQM/MM structure optimization prior to re-scoring. Our results proclaim that there is no obvious benefit from computing ligand conformational entropies or deformation energies and that optimizing only the ligand's geometry on the SQM level is sufficient to achieve the best possible scores. Instead, we leverage machine learning (ML) to include implicitly the missing entropy terms to the SQM score using ligand topology, physicochemical, and P-L interaction descriptors. Our new hybrid scoring function, named SQM-ML, is transparent and explainable, and achieves in average 9\% higher AUC-ROC than PM6-D3H4X/COSMO and 3\% higher than Glide SP, but with consistent and predictable performance across all test sets, unlike the former two SFs, whose performance is considerably target-dependent and sometimes resembles that of a random classifier. The code to prepare and train SQM-ML models is available at \url{https://github.com/tevang/sqm-ml.git} and we believe that will pave the way for a new generation of hybrid SQM/ML protein-ligand scoring functions.

AB - Identifying hit compounds is a principal step in early-stage drug discovery. While many machine learning (ML) approaches have been proposed, in the absence of binding data, molecular docking is the most widely used option to predict binding modes and score hundreds of thousands of compounds for binding affinity to the target protein. Docking's effectiveness is critically dependent on the protein-ligand (P-L) scoring function (SF), thus re-scoring with more rigorous SFs is a common practice. In this pilot study, we scrutinize the PM6-D3H4X/COSMO semi-empirical quantum mechanical (SQM) method as a docking pose re-scoring tool on 17 diverse receptors and ligand decoy sets, totaling 1.5 million P-L complexes. We investigate the effect of explicitly computed ligand conformational entropy and ligand deformation energy on SQM P-L scoring in a virtual screening (VS) setting, as well as molecular mechanics (MM) versus hybrid SQM/MM structure optimization prior to re-scoring. Our results proclaim that there is no obvious benefit from computing ligand conformational entropies or deformation energies and that optimizing only the ligand's geometry on the SQM level is sufficient to achieve the best possible scores. Instead, we leverage machine learning (ML) to include implicitly the missing entropy terms to the SQM score using ligand topology, physicochemical, and P-L interaction descriptors. Our new hybrid scoring function, named SQM-ML, is transparent and explainable, and achieves in average 9\% higher AUC-ROC than PM6-D3H4X/COSMO and 3\% higher than Glide SP, but with consistent and predictable performance across all test sets, unlike the former two SFs, whose performance is considerably target-dependent and sometimes resembles that of a random classifier. The code to prepare and train SQM-ML models is available at \url{https://github.com/tevang/sqm-ml.git} and we believe that will pave the way for a new generation of hybrid SQM/ML protein-ligand scoring functions.

U2 - 10.26434/chemrxiv-2022-68n6h

DO - 10.26434/chemrxiv-2022-68n6h

M3 - Preprint

BT - Enhancing Semiempirical Quantum Mechanical Scoring with Machine Learning: a new scoring function that accounts for both the enthalpic and entropic contributions to the ligand binding free energy

ER -