Last modified: 2019-07-29
Abstract
Background: The development of computational technology now accompanied by a large public database such as ChEMBL and easily accessible provides benefits for QSAR modeling which can be used for virtual screening and optimization of lead compounds. One target of a drug that is quite interesting to develop is DPP-4 for the treatment of diabetes because it is among the 10 biggest causes of death in the world today. Objective: This study aims to build models with good predictive abilities so that various chemical structures of compounds can be predicted for their bioactivity to target proteins using machine learning models. Materials and Methods: The ChEMBL database with the DPP-4 target of 4355 compounds with IC50 activity curated, duplication removed, missing value, salt removed and normalized structure resulted in approximately 3739 data sets, features extracted with various combinations of descriptors and fingerprints from CDK and RDKit which were then used to build models: Deep Learning, XGboost Tree Ensemble, Generalized Linear Model, Gradient Boosting Machine and Random Forest. The process was developed by automation from the beginning of preprocessing to the final stages of model evaluation with the Knime data mining application ver. 3.7. Results: The best model that produced with 5 times cross-validation was the Random Forest with R2 0,75 and MSE 0,615. The external validation test result showed that the QSAR model meets parametric standards for QSAR modeling. Conclusion: The model that has been resulted shown predictive ability with a low error and meets the QSAR modeling parametric elements so it could be applied for QSAR analysis of the DPP-4. Furthermore, this QSAR modeling process produced a workflow that can be applied to other targets not only DPP-4.