Skip to content

williamhuang3/ml-based-drug-identifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ML-Based Drug Identifier

Overview

This project aims to streamline the early phases of drug discovery using AI and bioinformatics, focusing on identifying and evaluating potential drug candidates. By harnessing the ChemBL Database, it searches for compounds that interact with specific chemical or biological targets—using Histone Deacetylase 1 as the primary example. The project assesses these compounds based on the Lipinski Rule of 5 and other molecular descriptors, classifying them into active or inactive groups based on their IC50 values. The IC50 metric is crucial as it measures the concentration required to inhibit a biological process by half, offering insight into a compound's drug efficacy.

Furthermore, the project employs the PaDEL Descriptors for a more in-depth analysis, aiming to predict the IC50 values of compounds using Random Forest Regression. This approach not only facilitates the identification of promising drug candidates but also significantly reduces research and development costs by circumventing the need for extensive experimental testing.

Dependencies

To run this project, the following dependencies are required:

  • ChemBL and RDKit:
    conda install -c rdkit rdkit -y
  • Bash (either through Conda or Git Desktop):
    conda install -c conda-forge bash
  • TextWrap3:
    pip install textwrap3
  • Other Essential Libraries (Matplotlib, Seaborn, Pandas, Numpy, Scikit-Learn, SciPy):
    pip install matplotlib seaborn pandas numpy scikit-learn scipy

Content

  • Introduction
    • Overview of the project, its objectives, and the methodology used.
  • Getting Started and Example Inputs
    • Getting started and showing suggested data for trial runs, including CHEMBL325, CHEMBL220, and CHEMBL3927, with CHEMBL325 as the primary example.
  • Plotting
    • Details on how data for the Lipinski Descriptors are plotted, including examples.
  • Regression
    • Explanation of how Random Forest Regression is utilized to predict IC50 values.
  • More Info and Credits
    • Additional resources and acknowledgments.

Getting Started

To initiate drug discovery, follow the installation steps to set up the environment and install necessary dependencies. Next, select a target from the suggested list or choose one of interest to you. The process involves extracting data on compounds interacting with the target, analyzing their properties according to the Lipinski Rule of 5, and employing statistical and machine learning models to evaluate their potential as drug candidates.

Data to try: CHEMBL325 (Histone deacetylase 1), CHEMBL220 (Homo Sapiens - Acetylcholinesterase), CHEMBL3927 (SARS coronavirus 3C-like proteinase)

In this example, I used CHEMBL325:

Example Screen

Results and Visualization

The project includes plotting the evaluated data using Matplotlib and Seaborn to visualize the distribution and comparison between active and inactive compounds across different molecular descriptors. These plots are crucial for understanding the characteristics that contribute to a compound's effectiveness and bioactivity.

Molecular Weight: Active vs. Inactive

plot_MW.png

Molecular Weight vs. Log(P): Active vs. Inactive

plot_MW_vs_LogP.png

Number of Hydrogen Acceptors: Active vs. Inactive

plot_NumHAcceptors.png

Number of Hydrogen Donors: Active vs. Inactive

plot_NumHDonors.png

IC50 Values: Active vs. Inactive

plot_ic50.png

LogP: Active vs. Inactive

LogP

Frequencies: Active vs. Inactive

plot_bioactivity_class.png

Predictive Modeling

Using PaDEL Descriptors and Random Forest Regression, the project aims to predict the IC50 values of compounds. The model learns to correlate the descriptors (features) with the IC50 values (target) across the training dataset. Random Forest improves prediction accuracy by creating a forest of decision trees where each tree is trained on a random subset of the data and features. This randomness helps in making the model more robust and less prone to overfitting to the training data, which is great for assessing a compound's viability as a drug candidate without extensive laboratory testing, offering a cost-effective and efficient alternative to traditional methods.

predicted_experimental_pIC50.png

Conclusion

This project represents a significant tool in the field of drug discovery, leveraging bioinformatics and artificial intelligence to streamline the search and evaluation of new drug candidates. By reducing the need for extensive experimental testing, it shows promise in accelerating and cutting costs in the development of effective treatments for a variety of conditions.

More Info And Credits

IC50 Definition

Lipinski Rule of 5

ChemBL Database

Random Forest Regression

This project was inspired by resources and tutorials from Data Professor on YouTube and machinelearningmastery.com Written by William Huang

Releases

No releases published

Packages

No packages published