Faculty.utrgv.edu
End of Semester Project for ELEE 4333: Topics in ELEE – Machine LearningExploration of Machine Learning Framework RDkit for Chemistry ResearchBy Ivan MoradoIntroductionMachine Learning’s wealth of algorithmic and mathematical structures allow for extraction of meaningful scientific data from experiments. Much of the algorithms can be applied to a wide range of fields in the sciences. For this report we will be focusing on two frameworks as applied to prediction tasks involving known chemical properties, and unknown but desirable chemical properties. The technical terms are QSPR (quantitative structure-property relationship) and QSAR (quantitative structure activity relationship). The most important thing to do when approaching a problem from the lens of machine learning is organizing the data in such a way that a computer can access it and meaningfully mutate it via computational analysis. We also want to make sure that the way we organize our data retains the most information of the problem at hand. In chemistry we can organize data in two ways, SMILES (Simplified Molecular Entry Line System) or a Molfile, which is a deeper representation of a molecule, though requires larger memory and processing speed for analyzing. We will be focusing on a very useful tool, RDkit, which is machine learning algorithms written in C++ and python for computational chemistry. Our goal is a proper prediction task for a set of molecules such that we can search for a chemical property known as lipophilicity, a chemical compound dissolving behavior in lipids, oils, fats. Being able to predict such a property from chemical data is useful for finding novel drugs. Another use of training a model for this prediction property is that it reduces time spent in the laboratory obtaining similar results, which can be costly in the long run. By using machine learning techniques, we can find lipophilicity without wasting time on chemical compounds that may not have the behavior of interest. RDkit FrameworkRDKit is a chemoinformatics software written in C++ and python. It holds many useful functions for extracting useful information from both SMILE and Molfile data formats of chemical information. Figures 1 and 2 show the two types of ways to store chemical information in a computer. Figure 1. SMILES format for data representation of moleculesFigure 2. Molfile format for data representation of moleculesBoth are very useful, with their pros and cons. The simplest representation for RDkit is SMILES, because they can be written as string data and be manipulated by the user. They don’t tell us much about higher dimensional behavior of our molecules and so we lose some exactness in interpreting results. Nonetheless they are still useful for basic chemical explorations. Molfiles are useful because they contain far more information, which also increase computational costs, thought gives us more meaningful results. Knowing the pros and cons of each we proceed with our property prediction task. It's usually evaluated by a distribution coefficent p - the ratio of a compounds concentration in a mixture of two immiscible phases at equilibrium (in our case water and octanol). The greater the p value, the greater the lipophilicity. Lipophilicity is usually presented as a log of base ten of p or: lipophilicity = log10pA useful function within RDkit is Chem.MolFromSmiles, which is used to convert a SMILES data set to a Molfile format. We use the following code in Figure 3. #Let's load the data and look at themdf= pd.read_csv('../input/mlchem/logP_dataset.csv', names=['smiles', 'logP'])df.head()#Importing Chem modulefrom rdkit import Chem #Method transforms smiles strings to mol rdkit objectdf['mol'] = df['smiles'].apply(lambda x: Chem.MolFromSmiles(x)) #Now let's see what we've gotprint(type(df['mol'][0]))Figure 3. Converting SMILES dataset to Molfile for pattern extractionWe can then visualize the molfile using the following function: from rdkit.Chem import Drawmols = df['mol'][:20]#MolsToGridImage allows to paint a number of molecules at a timeDraw.MolsToGridImage(mols, molsPerRow=5, useSVG=True, legends=list(df['smiles'][:20].values))Figure 4. Drawing molecules in RDKitFigure 5. Visualized Chemical CompoundsThe size of a molecule can be extracted from the number of atoms it possesses. We can extract corresponding values of the number of atoms for each chemical compound. By doing this we can attempt to see if there is any relationship between the lipophilicity and the number of atoms for our molecular dataset. If so, it can provide us information about the particular molecules lipophilicity. # AddHs function adds H atoms to a MOL (as Hs in SMILES are usualy ignored)# GetNumAtoms() method returns a general number of all atoms in a molecule# GetNumHeavyAtoms() method returns a number of all atoms in a molecule with molecular weight > 1df['mol'] = df['mol'].apply(lambda x: Chem.AddHs(x))df['num_of_atoms'] = df['mol'].apply(lambda x: x.GetNumAtoms())df['num_of_heavy_atoms'] = df['mol'].apply(lambda x: x.GetNumHeavyAtoms())Figure 6. Using RDKit functions for obtaining number of atoms per chemicalWe can then import seaborn and use a joint plot to plot the two variables. Figure 7. Joint Plot of Lipophilicity and Number of Hydrogen AtomsFrom a quick glance we see that there really isn’t any meaningful relationship using the number of hydrogen atoms alone. What we can do then is to see what the count of the frequently occurring atoms are in our dataset and seeing if there’s a pattern there. We can, for instance, look for the occurrence of a carbon atom. To do this we use the following function: # First we need to settle the pattern.c_patt = Chem.MolFromSmiles('C')# Now let's implement GetSubstructMatches() methodprint(df['mol'][0].GetSubstructMatches(c_patt))((0,), (1,), (2,), (3,))Figure 8. Obtaining Count of Commonly Occurring AtomsWhich gives us: Figure 9. Tuple of Positions Corresponding to Sub-patterns#We're going to settle the function that searches patterns and use it for a list of most common atoms onlydef number_of_atoms(atom_list, df): for i in atom_list: df['num_of_{}_atoms'.format(i)] = df['mol'].apply(lambda x: len(x.GetSubstructMatches(Chem.MolFromSmiles(i))))number_of_atoms(['C','O', 'N', 'Cl'], df)sns.pairplot(df[['num_of_atoms','num_of_C_atoms','num_of_N_atoms', 'num_of_O_atoms', 'logP']], diag_kind='kde', kind='reg', markers='+')plt.show()Figure 10. Refining Search for Common AtomsWhich gives us the following plots: Figure 11. Occurrence of List of Atoms across Molecular DatasetFirstly, we find that there is some linear dependence of lipophilicity on the number of some commonly occurring atoms. Secondly, we also find that the logP has a decent distribution for the lower row. Why is this useful? Because it clearly indicates that atom distribution in molecules has some relationship to lipophilicity. This means we can proceed to build our model. We’ll apply ridge regression which is an L2 regularization that helps with having a lot of predictor variables, or number of atoms. from sklearn.linear_model import RidgeCVfrom sklearn.model_selection import train_test_split#Leave only features columnstrain_df = df.drop(columns=['smiles', 'mol', 'logP'])y = df['logP'].valuesprint(train_df.columns)#Perform a train-test split. We'll use 10% of the data to evaluate the model while training on 90%X_train, X_test, y_train, y_test = train_test_split(train_df, y, test_size=.1, random_state=1)def evaluation(model, X_test, y_test): prediction = model.predict(X_test) mae = mean_absolute_error(y_test, prediction) mse = mean_squared_error(y_test, prediction) plt.figure(figsize=(15, 10)) plt.plot(prediction[:300], "red", label="prediction", linewidth=1.0) plt.plot(y_test[:300], 'green', label="actual", linewidth=1.0) plt.legend() plt.ylabel('logP') plt.title("MAE {}, MSE {}".format(round(mae, 4), round(mse, 4))) plt.show() print('MAE score:', round(mae, 4)) print('MSE score:', round(mse,4))#Train the modelridge = RidgeCV(cv=5)ridge.fit(X_train, y_train)#Evaluate resultsevaluation(ridge, X_test, y_test)Figure 12. Model for Predicting Lipophilicity TaskFigure 13. Results of our Trained ModelThe values for our MAE and MSE are decent, but could we do better? Yes! One way is to create new features of the molecules using RDkits descriptor function. We can add surface sums of polar atoms along with their attached hydrogen atom, exact molecular weight, number of valence electrons, and the general number of non-carbon atoms. from rdkit.Chem import Descriptorsdf['tpsa'] = df['mol'].apply(lambda x: Descriptors.TPSA(x))df['mol_w'] = df['mol'].apply(lambda x: Descriptors.ExactMolWt(x))df['num_valence_electrons'] = df['mol'].apply(lambda x: Descriptors.NumValenceElectrons(x))df['num_heteroatoms'] = df['mol'].apply(lambda x: Descriptors.NumHeteroatoms(x))Figure 14. Updated Descriptors for Lipophilicity Prediction Task ModelWe can then test the performance of our model with these new descriptors. train_df = df.drop(columns=['smiles', 'mol', 'logP'])y = df['logP'].valuesprint(train_df.columns)#Perform a train-test split. We'll use 10% of the data to evaluate the model while training on 90%X_train, X_test, y_train, y_test = train_test_split(train_df, y, test_size=.1, random_state=1)#Train the modelridge = RidgeCV(cv=5)ridge.fit(X_train, y_train)#Evaluate results and plot predictionsevaluation(ridge, X_test, y_test)Figure 15. Training the Model with new descriptorsWhich gives us a better error percentage. Figure 16. Trained Model with New Chemical DescriptorsConclusionRDkit is an incredibly useful machine learning framework specifically designed to allow complex modeling of chemical properties. The usefulness of such a software can be seen in various fields, including solar cell research, novel drug discovery, semiconductor materials research. Although this example is a simple one, complex regimes in chemistry may be computationally analyzed, leading to cost-effective methods in both experimental and theoretical work. References Kaggle Datasets: LogP of Chemical Structures.?,(correlations%20between%20predictor%20variables). ................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.