181
3D-QSAR/QSPR Based Surface- Dependent Modeling Approach Derived From Semi-Empirical Quantum Mechanical Calculations 3D-QSAR/QSPR-basierter, oberflächenabhängiger Modellierungsansatz, abgeleitet von semi-empirischen quantenmechanischen Rechnungen Der Naturwissenschaftlichen Fakultät der Friedrich-Alexander-Universität Erlangen-Nürnberg Zur Erlangung des Doktorgrades Dr. rer. nat. vorgelegt von Marcel Youmbi Foka aus Kamerun

3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

3D-QSAR/QSPR Based Surface-Dependent Modeling Approach Derived

From Semi-Empirical Quantum Mechanical Calculations

3D-QSAR/QSPR-basierter, oberflächenabhängiger

Modellierungsansatz, abgeleitet von semi-empirischen quantenmechanischen

Rechnungen

Der Naturwissenschaftlichen Fakultät

der Friedrich-Alexander-Universität

Erlangen-Nürnberg Zur

Erlangung des Doktorgrades Dr. rer. nat.

vorgelegt von Marcel Youmbi Foka

aus Kamerun

Page 2: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Als Dissertation genehmigt von der Naturwissenschaftlichen Fakultät/ vom Fachbereich Chemie und Pharmazie der Friedrich-Alexander-Universität Erlangen-Nürnberg

Tag der mündlichen Prüfung: 05.12.2018

Vorsitzender des Promotionsorgans: Prof. Dr. Georg Kreimer

Gutachter/in: Prof. Dr. Tim Clark

Prof. Dr. Birgit Strodel

Page 3: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

i

Dedication

In memory of my late Mother Lucienne Metiegam,

who the Lord has taken unto himself on May 3, 2009.

My mother, light of my life, God rest her soul, had a special respect for my studies. She had always encouraged me to move forward. I sincerely regret the fact that today she cannot

witness the culmination of this work. Maman, que la Terre de nos Ancêtres te soit légère!

This is a special reward for Mr. Joseph Tchokoanssi Ngouanbe,

who always supported me financially and morally. That he find here the expression of my deep gratitude.

Page 4: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

ii

Page 5: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

iii

Acknowledgements I would like to pay tribute to all those who have made any contribution, whether scientific or not, to help carry out this work. All my thanks go especially to Prof. Dr. Tim Clark, who gave me the opportunity and means to work in his research team. I am grateful to have had him not only supervise my work but also for his patience and for giving me the opportunity to explore this fascinating topic. As it was not in my area of expertise, I really enjoyed acquiring skills in this research area. I address my sincere thanks to Dr. Nico Van Eikema Hommes, for his technical assistance. I thank Dr. Christian Cramer for introducing me to the Val-Mlr and MOE programs, for the help he has always given me, and for becoming a close friend. I thank Dr. Jr-Hung Lin for helping me to attain knowledge in VAMP, ParaSurf and Material’s Studio programs. Due to the fact that I come from a French-speaking country, my level of English is not high. For this purpose, the contribution of Dr. Victoria Jackiw from the Language Centre of the University of Erlangen-Nuremberg for correcting the English quality of my thesis was a very great contribution. I am very grateful to her for this help. I sincerely thank Mr. Justin Choapoueng Nkue and Mrs. Elise Tchokoanssi Ngouanbe for the brotherly love they have always given me. While praying for the repose of the souls of Mom Pauline Sikompe, Julienne Tagny, Jean Lonkep, Mom Lydie Teukam, and Alice Nana, allow me to extend my thanks to Mom Anne Kouatchie, Mom Elisabeth Kamgho, Bernadette Mafodjo, Martin Kugoua, Gildas Nkue, Willy Nkue, Anick, Carine, Larissa, Boris, and Cynthia Tchokoanssi, Gisele and Beatrice Youmbi, Paulette, Muriel, Lynn, Nancy, Cindy, and Erwin Choapoueng, Patricia Ngangoua, Family Ngongang, Family Wember, Family Bagnessi, Armelle Nanguep, Fabien Touko, Emmanuel Tchokoanssi, Therese Yommo, Susanne Kamdem, Odette and Paul Bati, Hugues Lengue, Armand Lonkep, Dieudonne Fogaing, Hubert Djapou, Luther Tagny, and Astride Nguetchuessi for the solidarity and for the family love they never withheld from me. Father Rigoberg Beck was kind enough to help me spiritually and emotionally during this time dedicated to my Ph.D, especially during the illness of my mother and after her death. I will not forget to thank Lißet Prechtel (R.I.P), Family Labbat-Metiogno, Dr. Eva and Radim Beranek, Family Guiffo, Brigitte Wohleben, Ferdinand Kuete, Angelika and Siegfried Balleis, Birgitt Aßmus, Alexandra Wunderlich, all the members of Christlich-Soziale Union (CSU), Odon Fokou, Dr. Pierrette Fofana, Guy Toko, Sylviane Tassing, Maila Dengel, Family Wete, Family Kuate, Family Yaneu, Family Sadjue, Family Tsumbu, Merlin Nkodja, Yanick Modjo, Gervais Gamgmeni, Emerent Prowo, Nathalie Tchamdjou, Sorelle Nsogning, everyone from Hering & Schneider GmbH, Carole Nya, Rosine Niaba, Chanceline Kamdem, Noelia Santos, Rose Kouatchet, Claude Heuyam, and Helene Kankeu for their sincere friendship, which has always united us, their sympathy and their solidarity.

Page 6: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

iv

It comes from the heart to thank Simone Rennoch, Käthe and Josef Rennoch, and all the German community of Pommer for the love, affection and support they always gave me. It is a real pleasure for me to thank all the people of Computer-Chemie-Centrum in Erlangen (CCC) with whom I enjoyed working during these years of my thesis, particularly Prof. Dr. Paul von Rague Schleyer (R.I.P), Prof. Dr. Bernd Meyer, Prof. Dr. Dirk Zahn, Dr. Harald Lanig, Dr. Pawel Rodziewicz, Dr. Mateuz Wielopolski, Dr. Hakan Kayi, Dr. Ute Seidel, Dr. Jakub Goclon, Dr. Adria Gil Mestres, Dr. Ahmed Elkerdawy, Maximilian Kriebel, Dr. Pavlo Dral, Jürgen Wittman, Dr. Alexander Urban, Dr. Sebastian Schenker, Matthias Wildauer, Dr. Patrick Duchstein, Dr. Theodor Milek, Dr. Frank Beierlein, Tilo Sauertig, Oscar Roja, Heike Thomas, Dr. Christian Wick, Philipp Altmann, Bahanur Becit, Stefano Sansotta, Johannes Träg, Isabelle Schraufstetter, Nadine Scharrer, and not forgetting all the others. Finally, I would like to thank the German Academic Exchange Service (DAAD), which through its program "DAAD-STIBET Doktorandenabschlussförderung" granted me a scholarship at the end of my studies.

Page 7: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

v

Zusammenfassung

In dieser Arbeit werden einige neue QSAR/QSPR Modelle für die Vorhersage physikalisch-chemischer und biologischer Aktivitäten von organischen Verbindungen beschrieben.

Neue Modelle für die Berechnung der freien Solvatisierungsenergie in Wasser, Octanol und Chloroform wurden entwickelt, basierend auf Gasphase-Geometrien, die mittels AM1, AM1*, MNDO/d, und PM3-Optimierung durch VAMP berechnet wurden. Die neuen Modelle wurden durch eine Kombination der reinen Coulomb-Solvatisierungsenergie erhalten, abgeleitet aus einer SCRF-Berechnung, in Kombination mit einem Oberflächen-Integral als Funktion lokaler quantenmechanischer Eigenschaften auf der Oberfläche. Obwohl AM1* und MNDO/d keine lokale Polarisierbarkeit besitzen, wurde die Berechnung der Lösungsmitteleffekte für diese Hamiltonians durch eine Erweiterung der SCRF-routine auf s-und p-Orbitale, zu d-Orbitale ermöglicht. Die lokalen Eigenschaften wurden mit ParaSurf berechnet, basierend entweder auf der Isodichte-Oberfläche oder der sphärischen harmonischen Oberfläche. Die Modelle mit den statistisch besten Ergebnissen wurden mit der Isodichte-Oberfläche berechnet. Unter den Hamiltonians ergab AM1 die besten Vorhersagen mit (R2 = 0,92, MUE = 0,67, RMSD = 0,87), (R2 = 0,92, MUE = 0,57, RMSD = 0,73), (R2 = 0,91, MUE = 0,46, RMSD = 0,61), für die Solvatisierungsenergie in Wasser, Octanol und Chloroform. Für diese Solvatisierungsmodelle wurde herausgefunden, dass der Beitrag der jeweiligen lokalen Eigenschaft mehr als 30% für das molekulare elektrostatische Potential, (MEP, V), zwischen 15% und 25% für die lokale Ionisierungsenergie, (IEL), zwischen 15% und 20% für die lokale Elektronenaffinität, (EAL) und zwischen 10% und 18% für die lokale Polarisierbarkeit, ( L, POL) und die Härte, ( L, HARD) beträgt. Diese kleine Anzahl an verwendeten Variablen half dabei das Risiko der Erzeugung von übertrainierten Modellen erheblich zu verringern. Das Fehlen der lokalen Polarisierbarkeit für AM1* und MNDO/d äußerte sich signifikant, vor allem für die freie Solvatisierungsenergie in Wasser für neutrale und ionische Verbindungen mit einem RMSD-Unterschied von 6%, verglichen mit AM1 und PM3. Die Solvatisierungsmodelle in Wasser und Octanol, die mit neutralen Verbindungen entwickelt wurden, wurden zur Vorhersage des Octanol/Wasser-Verteilungskoeffizienten, logPow für kleine Moleküle angewandt. Für diese Verbindungen wiesen die Modelle eine sehr gute Vorhersagekraft auf, scheinen aber nur sehr eingeschränkt in der Lage zu sein, den logPow für große Moleküle zu berechnen. Der Chloroform/Wasser-Verteilungskoeffizient, logPcw für eine Reihe von kleinen Verbindungen wurde ebenfalls berechnet, um die Modelle zu validieren, wobei sehr gute statistische Ergebnisse erzielt wurden.

Es wurde ein neuer mathematischer Ansatz, basierend auf klassifizierten

Oberflächenabschnitten des molekularen elektrostatischen Potentials, (MEP), der lokale Ionisierungsenergie, (IEL), der lokale Elektronenaffinität, (EAL), der lokale Polarisierbarkeit, ( L), der Härte, ( L), der Elektronegativität, ( L, ENEG) und dem Feld senkrecht zur Oberfläche, (FN) und ihrer Kreuz-Produkte entwickelt. Dieser Ansatz unterscheidet sich grundsätzlich vom vorhergehenden polynomischen surface-integral model (SIM), dessen

Page 8: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

vi

Prinzip es ist, über eine molekulare Oberfläche MEP, IEL, EAL, L und L zu integrieren. Der neue Oberflächen-Integral-Modell-Ansatz wurde dann verwendet, um logPow Modelle für einen sehr großen Datensatz, die LOGKOW Datenbank, bestehend aus hauptsächlich neutralen, kleinen und großen Molekülen, zu erstellen. Ausgehend von den Gasphasengeometrien, die mittels AM1, AM1*, PM3, MNDO, MNDO/d und PM6 Optimierung durch VAMP erhalten wurden, wurden Modelle unter Verwendung der Isodichte-Oberfläche, beziehungsweise der vom Lösungsmittel ausgeschlossenen Oberfläche zur Berechnung der Deskriptoren, erzeugt. Es wurde herausgefunden, dass diese Modelle stark von der Flexibilität und Steifheit der Verbindungen beeinflusst werden und Verbindungen mit einer kleineren Anzahl an rotierbaren Bindungen besser vorhergesagt wurden. Modelle, die basierend auf der vom Lösungsmittel ausgeschlossenen Fläche berechnet wurden, ergaben hier die kleineren Abweichungen. Bezüglich der Solvatisierungsvorhersagen ergab AM1 die besten Ergebnisse für den Test-Datensatz (R2 = 0,89, MUE = 0,43, RMSE = 0,58) und in etwa 25 der 50 Gleichungen des bagging-Ansatzes nutzte es eine geringere Anzahl von Deskriptoren, nämlich 40 von 336 (11,90%). AM1* und MNDO/d ohne L basieren auf jeweils 252 Deskriptoren und verwendeten hiervon 39 (15,48%) beziehungsweise 55 (21,83%) für die einzelnen Gleichungen des bagging-Ansatzes. Aufgrund des Auftretens von MEP × FN in allen 50 Gleichungen für AM1* und MNDO/d, wurde FN als der Parameter identifiziert, der für die Kompensation des Mangels an L dieser Hamiltonians verantwortlich ist. Es wurde eine enge Beziehung zwischen FN und der Anzahl der Wasserstoffbrücken-Donatoren/Akzeptoren festgestellt, welche durch die starke Abhängigkeit der logPow Vorhersage von diesen Parametern bestätigt wurde.

Die bisher entwickelten logPow-Modelle wurden zur Vorhersage von Phospholipidose angewandt. Die Daten hierfür stammen von Pfizer Global R&D, Amboise/Frankreich und Sandwich/UK. Die logPow Werte, die mit den Modellen, basierend auf AM1, AM1*, MNDO, MNDO/d, PM3 und PM6, erhalten wurden, wurden mit den Standard ParaSurf-Deskriptoren kombiniert, um Sätze von 125 Deskriptoren zu erzeugen. Diese Deskriptorensätze wurden mit zwei verschiedenen Algorithmen des maschinellen Lernens (Naive Bayes und Random Forest) ausgewertet, um Verbindungen hinsichtlich ihrer Fähigkeit Phospholipidose zu induzieren, zu klassifizieren. Die besten Testdatensatzvorhersagen wurden mit den Modellen erzeugt, in denen die Deskriptoren mit der vom Lösungsmittel ausgeschlossenen Fläche berechnet wurden. Das beste Modell mit einer Genauigkeit von 84% wurde mit PM3 mittels der Random Forest-Klassifizierung erhalten. Der Naive Bayes-Algorithmus lieferte oberflächen-abhängige Modelle, die aber ein Ähnlichkeitsproblem in der Konfusionsmatrix aufwiesen. Dieses Problem wurde vollständig gelöst durch die Anwendung der Random Forest-Klassifizierung auf Gruppen von Deskriptoren, die die vom Lösungsmittel ausgeschlossene Oberfläche enthalten. Die Isodichte-Oberfläche ergab einen Matthews Korrelationskoeffizienten (MCC) zwischen 0,24 bis 0,48 mit einem Durchschnitt von 0,38 und von 0,33 bis 0,55 mit einem Durchschnitt von 0,47 für die Naive Bayes beziehungsweise Random Forest-Modelle. Bezüglich der vom Lösungsmittel ausgeschlossenen Oberfläche reichten die Werte des MCC von 0,33 bis 0,57, mit einem Durchschnitt von 0,47 für Naive Bayes, und von 0,50 bis 0,68, mit einem Durchschnitt von 0,57 für Random Forest, welcher die beste Vorhersagequalität erzielte. Zwei und zwanzig der 69 Verbindungen der

Page 9: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

vii

Versuchsanordnung wurden sowohl mit der Naive Bayes-, als auch mit der Random Forest-Klassifizierungsmethode sehr gut vorhergesagt.

Page 10: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

viii

Page 11: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

ix

Abstract

In this thesis, some new QSAR/QSPR models for predicting physico-chemical and biological activities of organic compounds are described.

New solvation models for calculating the solvation free energy in water, octanol and chloroform have been developed, proceeding by gas-phase geometries derived directly from AM1, AM1*, MNDO/d, and PM3 optimization through VAMP. Basically, these models were obtained by combining a pure Coulomb free energy of solvation derived from a SCRF calculation, with a local term calculated as a surface-integral of a function of local properties. Although AM1* and MNDO/d do not have local polarizability, the calculation of the solvent effect for these Hamiltonians was made possible by extending the SCRF routine, once limited to s and p-orbitals, to d-orbitals. The local properties were calculated with ParaSurf, using either the isodensity or the spherical harmonic surface. The best models, presenting better statistical performances, were performed with the isodensity surface (iso). Among the Hamiltonians, AM1 was found to be the one providing better qualities of prediction, with statistical performances of (R2 = 0.92, MUE = 0.67, RMSD = 0.87), (R2 = 0.92, MUE = 0.57, RMSD = 0.73), and (R2 = 0.91, MUE = 0.46, RMSD = 0.61) for the solvation free energy in water, octanol, and chloroform, respectively. For these solvation models, the contribution of each local property was found to be more than 30% for the molecular electrostatic potential (MEP, V), between 15% and 25% for the local ionization energy (IEL), between 15% and 20% for the local electron affinity (EAL), and between 10% and 18% for the local polarizability ( L

or POL) and the hardness ( L or HARD). This small number of variables used helped in reducing considerably the risk of generating overfitted models. The lack of L for AM1* and MNDO/d was significant, especially for the solvation free energy in water for neutral and ionic compounds, with a difference in RMSD of ≈ 6%, compared to AM1 and PM3. The solvation models in water and octanol developed with neutral compounds were applied for calculating the octanol/water partition coefficient, logPow, for small molecules. For these compounds, the models have provided very good predictive powers, but seem to be very limited when used to calculate the logPow for large molecules. The chloroform/water partition coefficient, logPcw, for a set of small compounds was also calculated in order to validate the models, and very good statistical performances were obtained.

A new approach, based among others on the MEP, IEL, EAL, L, L, the electronegativity ( L or ENEG), the field normal to the surface (FN), and their cross-products, over the surface divided into bins, is presented that is totally different from the former polynomial surface-integral model (SIM), whose principle was to integrate across a molecular surface MEP, IEL, EAL, L, and L. This approach, called binned SIM, was then used with a very large logPow data set obtained from the LOGKOW database to generate models necessary in predicting accurately the logPow, for a data set consisting of large and small compounds that are mainly present in their neutral forms. Proceeding by gas-phase geometries obtained from AM1, AM1*, PM3, MNDO, MNDO/d, and PM6 optimization through VAMP, the models were generated using either the iso or the solvent-excluded surface (SES) for calculating the descriptors. These models were found to be strongly influenced by the flexibility and the rigidity of the compounds used, and compounds having a small number of

Page 12: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

x

rotatable bonds were those giving good predictions. Models generated with sets of descriptors calculated with the SES presented better statistical performances. As for the solvation models, AM1 was the one providing better statistical performances for the test set (R2 = 0.89, MUE = 0.43, RMSE = 0.58), and in about 25 of the 50 bagging equations, utilized a lower number of descriptors, 40 among 336 (11.90%). AM1* and MNDO/d without L had 252 descriptors each and used 39 (15.48%) and 55 (21.83%) of them, respectively. Because of the occurrence of MEP × FN in all the 50 bagging equations for AM1* and MNDO/d, FN was found to be the parameter responsible for the compensation of the lack of L for these Hamiltonians. A close relationship was found between FN and the number of hydrogen bond donor/acceptors, confirming the strong dependence of the logPow prediction on these parameters.

The logPow models previously developed were applied to gas-phase geometries of sets of phospholipidosis-inducing compounds obtained from Pfizer Global R&D of Amboise Laboratories, France, and Sandwich Laboratories, UK. The logPow values obtained for AM1, AM1*, MNDO, MNDO/d, PM3, and PM6 were added to the standard ParaSurf descriptors calculated either with the iso or the SES to generate sets of 125 descriptors. These created sets of descriptors, used through two machine-learning (ML) algorithms (Naive Bayes and Random Forest), generated models to classify compounds according to their ability to induce phospholipidosis. These models, when evaluated on the respective test sets, provided better predictive performances for those generated with the descriptors calculated with the SES. The best model with a predictive power of 84% was obtained with PM3 through the Random Forest classifier (RF). The Naive Bayes (NB) algorithm provided surface-dependent models, but was faced with a problem of similarity in the confusion matrix. This problem was fully corrected by applying the RF classifier on sets of descriptors obtained with the SES. With the iso, the ranges of the Matthews Correlation Coefficient (MCC) were 0.24 to 0.48, with an average of 0.38, and 0.33 to 0.55, with an average of 0.47 for the NB and the RF algorithms, respectively. With the SES, the values of the MCC ranged from 0.33 to 0.57, with an average of 0.47 for NB, and from 0.50 to 0.68, with an average of 0.57 for RF, which yielded the best prediction quality. Twenty-two of the 69 compounds of the test set were found to be highly predictive by both classifiers.

Page 13: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Contents

Chapter 1: Introduction..................................................................................... 1

1.1 Computer and Life Skills.....................................................................................................2

1.2 Computational and Theoretical Chemistry Challenges.......................................................2

1.3 Computational Chemistry....................................................................................................4

1.3.1 Selection and Calculation of Descriptors..............................................................5

1.4 QSAR/QSPR Modeling.......................................................................................................6

1.5 Objective and Thesis Outline...............................................................................................9

1.6 References..........................................................................................................................10

Chapter 2: Predicting the Solvation Free Energy Using a Combination of Semi-empirical Self-consistent Reaction Field Calculations and the Local Energy Properties............................................................................................ 13

2.1 Introduction....................................................................................................................... 14

2.1.1 Surface-Integral Models (SIMs)......................................................................... 16

2.1.1.1 Isodensity surface.............................................................................................16

2.1.1.2 Spherical-harmonic surface..............................................................................17

2.2 Methods..............................................................................................................................17

2.3 Results................................................................................................................................18

2.3.1 Free Energy of Solvation.....................................................................................18

2.3.1.1 Local Properties................................................................................................18 2.3.1.2 Free Energy of Solvation in Water...................................................................20 2.3.1.3 Free Energy of Solvation in Octanol................................................................ 25 2.3.1.4 Free Energy of Solvation in Chloroform..........................................................27

2.3.2 The Partition Coefficient: logP............................................................................29

2.3.2.1 The Octanol-Water Partition Coefficient: logPow.............................................30 2.3.2.1.1 LogPow for Small Molecules..........................................................................30 2.3.2.1.2 LogPow for Large Molecules..........................................................................32 2.3.2.2 The Chloroform-Water Partition Coefficient: logPcw.......................................33

2.4 Discussion..........................................................................................................................35

2.5 Conclusions........................................................................................................................38

2.6 References..........................................................................................................................39

.

Page 14: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 3: Binned Surface-Integral Models for Predicting the Octanol-Water Partition Coefficient.............................................................................. 45

3.1 Introduction....................................................................................................................... 46

3.2 Methods of Calculating......................................................................................................47

3.2.1 Solvent-Excluded Surface (SES)........................................................................48

3.3 Results................................................................................................................................49

3.3.1 Conformational Dependence...............................................................................51

3.3.2 Comparison with Publicly Available logPow models.......................................... 62

3.3.3 Variable Importance............................................................................................62

3.3.4 Variable Dependence...........................................................................................64

3.4 Discussion..........................................................................................................................67

3.5 Conclusions........................................................................................................................74

3.6 References..........................................................................................................................76

Chapter 4: Comparative Study of two Classification Algorithms for the Prediction of Drug-Induced Phospholipidosis................................................81

4.1 Introduction........................................................................................................................82

4.2 Methods..............................................................................................................................83

4.2.1 Machine Learning Algorithms.............................................................................85

4.2.1.1 Naive Bayes......................................................................................................85 4.2.1.2 Random Forest..................................................................................................86

4.3 Results................................................................................................................................87

4.3.1 Machine Learning Models...................................................................................88

4.3.1.1 Naive Bayes Models.........................................................................................88 4.3.1.2 Random Forest Models.................................................................................... 94

4.4 Discussion........................................................................................................................107

4.5 Conclusions......................................................................................................................110

4.6 References........................................................................................................................111

Appendix..........................................................................................................115

.

Page 15: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 1

Introduction

Page 16: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 1

1.1 Computer and Life Skills

In the past, in some African countries such as Cameroon, some tasks were difficult to accomplish because people had access only to some rudimentary tools and archaic techniques. Our grandparents and parents used stones and mortars to crush corn, rice, millet and other grains. Today, thanks to many advances made in the field of science and technology, these same people, once forced to use archaic tools, have seen these light duties facilitated by the arrival of tools and machinery derived from new technologies. Thus, the trader who was once forced to add using a sheet of paper and a pen can do it today far more easily and quickly with a calculator. The farmer who could only use a hoe or a pick for his work benefits from the invention and manufacture of machinery and equipment that are easily manipulated and accessible. The office secretary who used a typewriter to type can do it quickly with a computer. The payment statements and the data of a company’s staff, formerly written in large and sometimes cumbersome registers, can be easily generated by the use of this valuable tool, the computer. The computer is an electronic machine whose operation is based on automatically reading a set of instructions sequentially, allowing the execution of arithmetic and logic operations on bits. Because of this, the computer has become the tool most often used in all areas of life, particularly in research, health, trade, transport, communication and many other fields. In the field of research, and more specifically in chemistry, the computer can be used to draw molecules, graph and perform statistics, theoretical calculations, simulations and computations. 1.2 Computational and Theoretical Chemistry Challenges

Quantum chemistry is a science whose main purpose is to clarify the electronic structure of molecules. To achieve this goal, it uses models and principles of quantum mechanics. Traditionally, quantum chemistry was derived from quantum mechanics, which is a discipline that has more in common with general physics. Usually qualified as quantum-chemical, these models contribute effectively to the characterization and description of molecules and their various interactions. Related to quantum chemistry, computational chemistry finds its origins in efficient computer implementation, focused particularly on the implementation of specific chemical phenomena and of quantum-chemical models already established. In chemical research, it is not absurd to actively take part in establishing the differences between computational and quantum directions, which a priori seem to lie in the methodology and how the results are interpreted. Based exclusively on the foundational discovery and theory of quantum mechanics, it seems that historically the first theoretical calculations were performed by Walter Heitler and Fritz London in 1927. Objectively, the formulation of a scientific work takes into account several parameters including documentation and bibliography. For this purpose, the beginnings of computational and quantum chemistry were strongly influenced by the books of Linus Pauling and E. Bright Wilson1; Eyring, Walter and Kimball2; Heitler3 and later Coulson4, which in following years were used as the first references by chemists. In 1956, the use of a basis set of Slater orbitals, actually just applicable to diatomic molecules, was involved in the first ab initio Hartree-Fock calculations at the Massachusetts Institute of Technology (MIT). A few years later, Hückel undertook research on the development of a method to determine with less complexity

2

Page 17: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Introduction

electron energies of molecular orbitals of electrons in conjugated hydrocarbon systems. Thus, using a theoretical approach based on the linear combination of atomic orbitals (LCAO), his work was crowned by the development of a calculation method5 for molecules, extending from butadiene and benzene to ovalene. Later in the 1970s, in order to improve these one-electron methods, which while effective still had a narrow scope, we witnessed the advent of semi-empirical methods such as CNDO6.

Methodically, one notable difference is observed between computational and

theoretical chemistry. In computational chemistry, the methodological principle is based on the direct implementation through a computer of a mathematical method fully developed in order to get programs specifically adapted to the desired methodology, which later will be automatically used in conjunction with the computer to efficiently solve well-defined chemical problems. In contrast, theoretical chemistry sits firmly on a purely mathematical description of chemistry, through algorithms and computer programs exclusively developed by chemists, physicists and mathematicians. Thanks to this, it contributes to the prediction of atomic molecular properties and reaction paths for chemical reactions. Computational chemistry, which goes along with computer science, can be defined as a branch of chemistry dealing with scientific problems, applying simultaneously the concepts of chemistry and some principles of computer science. Theoretical chemistry may play a complementary role to computational chemistry to the extent that, thanks to some of its results, computational chemistry can through enough powerful computer programs determine the structures and properties of molecules and solids. Computational chemistry through a singular approach bypasses the difficult analytical resolution of the quantum n-body problem, which is quite complex in closed form, apart from the hydrogen molecular ion; generally its application in areas such as the design of new drugs offers remarkable success. Fundamentally, in computational chemistry calculations, we proceed by processes ranging from highly accurate to very approximate, allowing predictions that confirm or provide further information on the results arising from chemical experiments and, in some cases, providing free access to the study of chemical phenomena that were previously unobserved. Each of these methods has a particular characteristic that can be directly related to its principle of operation or its scope. Thus, highly accurate methods are very appropriate for the study of small systems and associated applications, while ab initio methods apply entirely the fundamental basis of the theory of the first principles. Empirical or semi-empirical methods contribute remarkably to the development of many approximations that help to characterize some elements of the underlying theory. For this purpose, the center of interest lies in the rational exploitation of experimental results obtained from acceptable models of atoms or related molecules. However, these empirical or semi-empirical methods generally belong to the group of less accurate methods. The use of certain approximations remains a fundamental parameter in the process of developing both ab initio and semi-empirical methods. Ab initio methods focus exclusively on the use of the Born-Oppenheimer approximation; this facilitates systematically the simplification of the underlying Schrödinger equation by freezing the nuclei in place during the calculation. At first, reducing the number of approximations has a highly positive effect for ab initio methods, insofar as it entails the absolute convergence of the underlying equation to an exact solution. However, in practice, the prospect of a complete elimination of all approximations is a phenomenon difficult or impossible to achieve, because the residual errors in themselves are highly capable of being present and tend to remain permanently. As one of the main goals, computational chemistry aims to minimize residual errors without affecting systematically the calculations. The determination of molecular structures by an

3

Page 18: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 1

approach consisting of simulating forces, the specific use of quantum mechanical methods for determining more explicitly the points on the energy surface that remain invariant after any change in the position of the nuclei, the effective synthesis of molecular compounds using appropriated computational techniques, the active search through databases and storage of data on chemical entities (chemical databases), the estimation of a direct relationship or correlation existing between chemical structures and properties (QSPR/QSAR), and the design of molecules capable of undergoing a specific interaction with other molecules through computational approaches can be distinguished among these several major areas. 1.3 Computational Chemistry On the health front, the world today is subject to new pandemics or diseases that are a real obstacle to the full development of human beings. For this purpose, to deal with this problem, it becomes imperative and a major challenge to conduct an active search for the discovery of new inexpensive and readily available medications. A statistical study exclusively conducted on new drugs has established that for a sample of 10,000 molecules synthesized and tested, on average, one has the characteristics of an innovative product with commercial properties. Generally, the cycle of development of a new drug is a particularly long process that can spread over 10 to 15 years of research. Indeed, during this development, the major objective is to achieve the establishment of a molecule that possesses not only particular therapeutic properties but also and above all an excellent ability to produce a minimum of unwanted side effects. To properly conduct these syntheses, which are often unnecessarily executed over a long period of time, the strong mobilization of human, material and financial means is required. These factors have an enormous influence on the final product, which is often found on the market at a very high price and therefore not easily accessible for individuals with an average standard of living. To overcome this disadvantage, researchers in the pharmaceutical industry have proposed a new method that consists of predicting in advance the properties and activities of molecules before moving on to the final step, which is the realization of their synthesis in a laboratory. In computational chemistry, two fields of research, Quantitative Structure-Activity Relationships (QSAR) and the Quantitative Structure-Property Relationship (QSPR), whose principal objectives are the identification of commonalities that may exist between molecules in large databases of existing molecules whose properties are known, have been developed to meet this urgent need. The highlighting of such a relationship has many advantages: On the one hand, it contributes to determining the physical and chemical properties and biological activities of compounds. On the other hand, it participates actively in the development of new theories, or it helps to obtain a fair idea about the observed phenomena in order to conduct a comprehensive study of whole families of compounds or to synthesize new molecules without using data obtained from this synthesis in the laboratory. Today, thanks to the advent of molecular modeling, we can establish without much difficulty relationships that may exist between the structures of molecules and their properties or activities. In molecular modeling techniques, the molecules are usually characterized by a set of data consisting of descriptors, which are measures or real numbers derived from calculations performed on the molecular structures. This opens the way for the establishment of a possible relationship that can exist between these descriptors and the modeled properties. However, these methods whose

4

Page 19: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Introduction

effectiveness is well established are still facing difficulties that are mostly related to the calculation and selection of relevant descriptors.

1.3.1 Selection and Calculation of Descriptors

In recent decades, the thorny problem about how chemical information extracted from molecular structures, commonly referred to as a set of real numbers or descriptors, could be effectively symbolized was the epicenter of several research works. Once these descriptors are adequately represented, they allow, through traditional modeling techniques, the establishment of a relationship between the chemical information contained in the molecule and a molecular property or activity. These numerical descriptors are responsible for the transmission of information in a vector of real functions. They can be used to perform a quantitative assessment of physico-chemical or structural characteristics of molecules and currently we can evaluate more than 3000 kinds of descriptors. We can determine the different descriptors using either empirical or semi-empirical methods. The use of descriptors obtained without resorting to a transition by experimental methods, namely directly by calculation or prediction, is a highly appropriate technique that allows predictions while bypassing the molecule synthesis step, and this approach is one of the major points of the modeling. In reality, obtaining data through an experimental process seems to be the most accessible way, compared to the prediction of a property or activity such as the logPow7, L, or IEL, which constitute a small set of descriptors that can be determined by measurement. However, these methods for determining the activity or property of a molecule are sometimes faced with the problem of misunderstanding and misinterpretation of the mechanisms. Thus, obtaining first a sufficiently large number of different descriptors would be the most important step for the early stages of modeling, which will be followed by the selection of those that have a considerable influence or that are most relevant for modeling. Generally in modeling, descriptors are divided into three main groups, 1D, 2D, and 3D descriptors.

The 1D descriptors that define particularly the atomic distribution (number and type of

atoms), or the mass composition (molar mass) of a molecule, give further details on the global properties of the molecule. These descriptors that are generated directly from the empirical formula of the compound fail to establish a difference between the different constitutional isomers.

The 2D descriptors predominantly composed of constitutional indices (number of

simple and multiple bindings, number of cycles, etc.) or topological indices (Wiener8, Randic index9, valence connectivity index of Kier-Hall10, and the Balaban index11) provide guidance on the actual structure of the molecule, including its shape, size, and ramifications. These descriptors, obtained exclusively by using the formula of the compound and whose major asset is to characterize the physical properties of the molecule, however, show some shortcomings when used to describe some properties or activities such as biological activity.

The 3D descriptors that can be assimilated into geometrical descriptors (the molecular

volume, the solvent accessible surface, and the principal moment of inertia), electronic descriptors (dipole moment, ionization potential, and other energies related to the molecule), and even spectroscopic descriptors allow a fairly broad description of the complex

5

Page 20: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 1

characteristics. These descriptors are entirely dependent on the 3D geometry, which allows different atoms constituting the molecule to adopt sufficiently stable relative positions from which the descriptors can be easily generated by performing some empirical or ab initio molecular modeling calculations. Among these descriptors, some have a feature that can be independently linked to their methodology or their functionality. Electronic descriptors, for example, are derived very often from quantum chemical calculations, in which microscopic descriptors can characterize molecules by spectroscopic measurements such as vibrational wave functions. 1.4 QSAR/QSPR Modeling In the late 19th century, Crum-Brown and Frazer12 had the brilliant idea of attempting to model the activity of molecules, which allowed them to highlight the existence of an extremely close relationship between the biological activity of a molecule and its chemical composition. Later, the year 1964 was marked by the advent of the famous “Group Contribution” Theory, which therefore became the true detonator for the beginning of QSAR modeling. The establishment of new modeling techniques for learning, which were initially linear and subsequently nonlinear, had an explosive effect on the development of many methods whose focal point was to highlight the relationship that may exist between the molecular descriptors and the properties or activities to be predicted. QSAR, or QSPR, is the process used to establish a link in proportion to a certain quantitative value between a defined chemical structure and a well-known process that is generally a biological activity or chemical reactivity. The fundamental principle of the Structure-Activity Relationship (SAR) is based on the hypothesis that all molecules with common features are capable of producing similar biological activities. However, the difficulty commonly encountered in the applicability of this hypothesis seems to be related to the manner in which any difference existing at the molecular level can be considered as the main parameter on which each type of activity, such as the reaction or biotransformation ability, the solubility, or the target activity, depends. QSARs are an assembly consisting of predictive models, generated beforehand through statistical tools, whose purpose is to establish a certain parallelism between the biological activity (including desirable therapeutic effects and undesirable side effects) of chemicals (drugs, toxicants, or environmental pollutants) and descriptors representative of the molecule and/or its properties. The scope of QSAR models has been expanded and the most significant areas are, among others, risk assessment, the prediction of toxicity, regulatory decisions13, drug discovery and lead optimization14. Generating a good QSAR model and satisfying the set standards is entirely dependent on the choice of biological data, descriptors, and sufficiently appropriated statistical methods. The major goal of any QSAR modeling is to produce statistically robust models, which can be easily used to perform an efficient and reliable prediction of the biological activities of newly discovered compounds. Although obtaining a QSAR model with highly significant characteristics is extremely parameterized by the quality of the input data, the selection of descriptors, and the choice of the statistical approach to be used, its performance is nonetheless related to its validation, which clearly remains the only way by which one can establish the relevance and reliability of a procedure applied in a particular case15. The techniques necessary for the determination of training set compounds16, setting the training set size17, and the effect of the distribution of variables according to their importance in the evaluation of training set models, which gives a basic idea of the quality of predictions18, seem to be the main parameters to take into account

6

Page 21: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Introduction

during the validation of a QSAR model. Moreover, one of the most important points allowing for the appreciation of the quality of QSAR models is the focus on the development of novel validation parameters19. The prediction of boiling points is one of the first applications that has been conducted with success, and which also has particularly marked the history of QSAR20.

In organic chemistry, the chemical compounds that share the same functional groups very regularly possess structures with strong correlations to the properties to be predicted. Determining experimentally the biological activity of a molecule permits the evaluation of the degree of inhibition of a sufficiently defined signal transduction or metabolic pathway. In drug discovery, the biological activity of a chemical species is commonly referred to as its toxicity. For this purpose, the chemical molecules whose inhibitory effects exerted on their respective targets have been judged successful and whose degree of toxicity is sufficiently diminished compared to the threshold (non-specific activity) are most appropriate for the application of QSAR techniques, which thereby facilitates their identification. The biological activity, recognized in pharmacology under the name of pharmacological activity, provides a total reflection of all effects, whether desirable or not, that a drug can cause when it is in the presence of living matter. Given the close relationship between the pharmacological activity and the beneficial or adverse effects of drug candidates, it is quite natural that the toxicity of a chemical structure is fully assimilated to the type of biological activity.

The logPow, as stated in Lipinski’s Rule of Five21, plays a significant role in the

different QSAR applications that are related to the identification of “drug likeness”. The development of a QSAR/QSPR model follows a process whose general mechanism is constituted by the following steps:

In 1997, Christopher A. Lipinski working on lipophilicity laid down a principle called Lipinski’s Rule of Five, which states that many medications are relatively small molecules that very often belong to the family of lipophilic compounds21. This rule, which has seen

Query experimental data 2D to 3D conversion (Corina/Concord)

Quantum chemical calculations (VAMP)

Descriptors calculation (ParaSurf)

QSAR/QSPR modeling

7

Page 22: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 1

some popularity in QSAR/QSPR, shows particular success when applied in the evaluation of drug likeness, the prediction of chemical compounds that with a certain threshold of pharmacological or biological activity can induce some effects in humans when administered orally. Thus, according to Lipinski, a drug whose biological properties confer an ability to produce certain effects in humans after oral intake must at least obey one of the following basic criteria:

• A ClogP value not greater than 5 (logP units)22, • A molecular weight not greater than 500 Dalton, • A number of hydrogen bond donors not greater than 5 (sum of –OH’s and –NH’s), • A number of hydrogen acceptors not greater than 10 (sum of N and O atoms).

In QSPR, several methods, which have proven to be successful, for an effective

determination of logP have been developed. Among these methods, we can mention: • Atomic based prediction, or atomic contribution (AlogP, MlogP, etc.), • Fragment based prediction, or group contribution (ClogP, etc.), • Data mining prediction, • Molecule mining prediction, • Estimation of logD (at given pH) from logP and pKa.

The distribution coefficient of a molecule, logD23, is this relationship expressed as a ratio between two variables previously obtained by summing the concentrations of ionized forms and of un-ionized forms. Sometimes it is equal to logP of un-ionizable compounds when a certain pH is reached.

In molecular modeling, descriptors that symbolically support some of the information contained in the molecule play a fundamental role, and insofar as faithfully relaying this information they can help to predict effectively the property or activity of a molecule. The group contribution technique, which was one of the first methods applied in the early era of QSAR modeling, remains today the main alternative for specific applications such as those involving the characterization of molecules. In 1988, Cramer initiated the development of Comparative Molecular Field Analysis (CoMFA24), a method mainly based on a preliminary alignment of molecules in order to direct them all towards a direction favorable for modeling. This method that totally optimizes some applications seems to be the most plausible alternative to bypass some conventional or classical methods, which up to now show some shortcomings when applied to the determination of biological activity. Generally, the interaction of a molecule (ligand) with the corresponding receptor is the main factor that regulates the biological activity of the molecule. The precise determination of this interaction and the nature of the relationship between it and the activity under investigation are the catalyst for the modeling of the biological activity. Compared to different approximations related to the development of QSAR, the CoMFA method appears to be the most appropriate for applications oriented to the modeling of protein-ligand interactions.

8

Page 23: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Introduction

1.5 Objective and Thesis Outline

As part of this project, we have developed a new method for predicting the solvation free energy of small organic compounds in different solvents, based on a combination of the self-consistent reaction field (SCRF) calculations25 and the local energy properties. Through the SCRF routine, we have extended the calculation of the solvent effect to d-orbitals by an implementation of the multipole approach. New logPow models have been developed, using our recent binned SIM, which is based on binned area descriptors, in contrast to our old polynomial SIM models. For these logPow models, several types of application can be envisaged. The thesis is structured as follows:

Chapter 1 is a general introduction about QSAR/QSPR modeling, including the basic

concept and its application to biological activity. The second chapter of this thesis presents the surface effect on the calculation of the

“local surface tension” contribution to the solvation free energy at the molecular surface, the techniques used to determine the solvent effect, the solvation models obtained, the order of importance of each local property on these solvation models, and the problems related to their validations and applications.

Chapter 3 introduces the concept of binned SIM used for the development of our

logPow models. We then show how the prediction of the logPow is related to the field normal to the surface (FN), the flexibility/rigidity of a molecule, the number of hydrogen bond donor/acceptor atoms, and the molecular surface.

Chapter 4 is devoted to applications of logPow models previously developed for the

classification of phospholipidosis-inducing compounds. We present a prediction of the activity of some drugs that may induce the accumulation of phospholipids in the human body. We stress the ability of two ML algorithms (RF and NB) to predict induction of phospholipidosis. We evaluate the effect of the classification approach and the molecular surface used on the prediction quality.

9

Page 24: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 1

1.6 References 1. Linus, Pauling.; E, Bright Wilson. Introduction to Quantum Mechanics- with Applications to

chemistry. McGraw-Hill Education, 1935.

2. Eyring, Henry.; Walter, John.; Kimball, Georges. Quantum chemistry. John Wiley And Sons Inc, 1944.

3. Walter, Heitler. Elementary Wave Mechanics- with Applications to Quantum Chemistry.

Oxford: Clarendon Press, 1945.

4. Coulson, C. A. Textbook valence. Oxford: Clarendon Press, 1952.

5. Streitwieser, A.; Brauman J. I. and Coulson C. A. Supplementary Tables of Molecular Orbital Calculations. Oxford: Pergamon Press, 1965.

6. Pople, John A.; David L. Beveridge. Approximate Molecular Orbital Theory. New York:

McGraw Hill, 1970. 7. Hansch, C.; Leo, A.; Hoekman, D. Exploring QSAR: Hydrophobic, Electronic and Steric

Constants. American Chemical Society: Washington, DC, 1995.

8. Wiener, H. Structural Determination of Parafin Boiling Points. Journal of Chemical Information and Computer Sciences 1947, 69, 17-20.

9. Randic, M. On Characterization of Molecular Branching. Journal of the American Chemical

Society 1975, 97, 6609-6614.

10. Kier, L. B.; Hall, L. H. Molecular Connectivity in Chemistry and Drug Research. New-York: Academic Press, 1976.

11. Balaban, A. T. Highly Discriminating Distance-Based Topological Index. Chemical Physics

Letters 1982, 89, 399-404.

12. Crum-Brown, A.; Frazer, T. On the Connection between Chemical Constitution and Physiological Action. Transactions of the Royal Society of Edinburgh 1868-69, 25, 151-203.

13. Tong, W.; Hong, H.; Xie, Q.; Shi, L.; Fang, H.; Perkins, R. Assessing QSAR Limitations- A

Regulatory Perspective. Current Computer-Aided Drug Design 2005, 2, 195-205.

14. Dearden, J. C. In Silico Prediction of Drug Toxicity. Journal of Computer-Aided Molecular Design 2003, 17, 2-4, 119-127.

15. Roy, K. On Some Aspects of Validation of Predictive Quantitative Structure-Activity

Relationship Models. Expert Opin. Drug. Discov. 2007, 2 (12), 1567-1577.

16. Leonard, J. T.; Roy, K. On Selection of Training and Test Sets for the Development of Predictive QSAR Models. QSAR & Combinatorial Science 2006, 25 (3), 235-251.

10

Page 25: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Introduction

17. Roy, P. P.; Leonard, J. T.; Roy, K. Exploring the Impact of Size and Training Sets for the

Development of Predictive QSAR Models. Chemometrics and Intelligent Laboratory Systems 2008, 90 (1), 31-42.

18. Roy, P. P.; Roy, K. On some aspects of variable selection for partial least squares regression

models. QSAR & Combinatorial Science 2008, 27 (3), 302-313.

19. Roy, P. P.; Paul, S.; Mitra, I.; Roy, K. On two Novel Parameters for Validation of Predictive QSAR Models. Molecules 2009, 14 (5), 1660-1701.

20. Rouvray, D. H.; Bonchev, Danail. Chemical graph theory: introduction and fundamentals.

Tunbridge Wells, Kent, England: Abacus Press, 1991. 21. Lipinski, C. A.; Lombardio, F.; Doming, B. W.; Feeney, P. J. Experimental and

Computational Approaches to estimate solubility and permeability in drug discovery and development settings. Adv. Drug. Del. Rev. 2001, 46, 3-26.

22. Leo, A.; Hansch, C.; Elkins, D. Partition Coefficients and their uses. Chem. Rev. 1971, 71 (6),

525-616.

23. Csizmadia, F.; Tsantili, A.; Panderi, I.; Darvas, F. Prediction of Distribution Coefficient from Structure. 1. Estimation Method. Journal of Pharmaceutical Sciences 1997, 86 (7), 865-871.

24. Cramer, R. D.; Patterson, D. E.; Bunce, J. D. Comparative Molecular Field Analysis

(CoMFA). 1. Effect of shape on binding of steroids to carrier proteins. Journal of the American Chemical Society 1988, 110 (18), p. 5959.

25. Tomasi, J.; Persico, M. Molecular Interactions in Solution: An Overview of Methods Based on

Continuous Distribution of the Solvent. Chem. Rev. 1994, 94, 2027-2097.

11

Page 26: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

12

Page 27: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 2

Predicting the Solvation Free Energy using a Combination of Semi-empirical Self-consistent

Reaction Field Calculations and the Local Energy Properties

Page 28: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 2

2.1 Introduction

In recent years, the study of quantum chemistry has entered the mainstream for the investigation of organic molecules and the reaction mechanisms that take place in the gas phase. This method, which over time has become increasingly reliable, demonstrated its efficacy when it was used to predict the electronic properties of organic compounds1. Recently it has become apparent that in molecular modeling there is the increasing necessity of considering new paradigms for quantitative structure-activity (QSAR) and structure-property (QSPR) relationships, scoring functions and docking, and other applications in cheminformatics and modelling2. QSAR are related to some regression models that have as a culminating point the harmonious integration of experimental properties with continuous values, such as aqueous solubility, melting point, blood-brain barrier permeability, hydrophobicity, barrier penetration, lethal concentration, or inhibitor constants for enzymes. The major advantage of QSAR models is based on the fact that, relying on new classes of structural descriptors or more powerful statistical models, an extension of the Hansch model, which was already in evidence in molecular modeling, was made possible3,4. Nowadays, the QSAR models that are developed all tend to focus on the study of the intrinsic relationships that may exist between the structures of chemical compounds represented as chemical networks5-8. Through their early work on estimating the physical properties of molecular structures (QSPRs), Hansch and Leo9,10 indeed laid the foundations for computational chemistry. Today, molecular modeling is strongly influenced by generating models based on the octanol-water partition coefficient (logPow11-13), standard enthalpies of formation14-16, boiling points17, melting points18, and aqueous solubility19,20. These models are based on incremental approaches that involve splitting the molecule into atoms or groups in which each fragment thus formed is attributed to an additional contribution.

The aim of this work is to develop some new QSPR models for determining the free energy of solvation (ΔGsolv) of organic compounds in different solvents, based on the self-consistent reaction field (SCRF) calculation of the solvent effect and the local energy properties. To address this problem, the first stage of development was to implement the multipole model developed by Clark et al.21 into the SCRF routine22, so that the solvent effects for compounds containing either s,p-orbitals or s,p,d-orbitals could be easily calculated.

In each solution, aqueous or not, there is an interaction force between the solute and

the solvent, which is usually parameterized by the geometry of the solute. Thus, to calculate ΔGsolv, one must first determine an average conformation for the solute. This problem can be solved by doing a single calculation using either the optimized gas-phase geometry or the optimized liquid-phase geometry23,24. In the SCRF model25-27, the reaction field and the Hamiltonian are closely related and therefore are virtually united in one set by a direct integration of the former into the latter. Thanks to this particularity, SCRF became the most appropriate model for both semi-empirical and ab initio molecular orbital treatments. The SCRF context can provide guidance on the difference between the models of the cavity and those not requiring the definition of an area delimiting the solute of the solvent. The first tests that have allowed a detailed description of the chemical environment were made by Klopman28, Germer29 and Miertus30. Based on Born’s31 formalisms, they were able to generate solvation models appropriate for this description. This approach highlights the effect of the solvent on the solute, which is summarized in a set consisting of the negative of the

14

Page 29: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Self-consistent Reaction Field Calculations

Mulliken’s net charge32 for the atomic center in connection with the solvation. Truhlar et al.33 proceeded by a dependent SCRF approach, which comprises forming a single unit resulting from the inclusion of the solvent effect in the Hamiltonian. Tomasi34 and Olivares del Valle35 by an approach based on the use of an arbitrarily shaped boundary outlined the concept of continuous models. Their main characteristics are their application to the ab initio formalisms and their ability to predict accurately the solvent effects. Solvent effects can also be calculated with several other techniques, such as the Generalized Born model (GB)36-41, the Poisson-Boltzmann (PB)42,43 model, and the Conductor-like model (COSMO)44-46.

In the SCRF approach22 used for this work, the algorithms of Pascal-Ahuir47 and

Connolly48 were adapted to define an arbitrarily shaped cavity. In parallel, the approach of Marsili49 was modified using a marching-cube algorithm in order to make the obtaining of surface points accessible. In SCRF theory, the molecular electrostatic potential (MEP) allows qualitative definition of the nature of the reaction field as well as the electrostatic interaction free energy. The MEP has a major impact in some areas of computational chemistry, such as drug design50, the simulation of intermolecular interactions51, molecular similarity studies52, and continuum solvent models within molecular orbital (MO) theories51. Because of its current applications in evaluating continuous models, calculating the solvent effects through the SCRF approach requires the development of accurate methods that calculate within a short period of time the MEP at the van der Waals surface (or thereabouts).

Thus, the MEP53 seems to be the main catalyst in applications closely related to

intermolecular interaction energies, such as QSAR54, QSPR55, prediction of toxicity56, docking57, (continuum) solvation models58-60 and many others. Using results derived from the quantum mechanical calculation, the MEP (V(r)) can be obtained by the formula

V(r) = = −

n

i i

i

rRZ

1

-∞

∞− −rrr

')'(ρ dr’ . (2.1)

Here V(r) represents the electrostatic potential at any point r, n the number of atoms in the molecule, Z i is the nuclear charge of atom i located at R i and ρ (r,) is the electronic density function of the molecule.

The point charges used for calculating the MEP are calculated using either the natural atomic orbital point charge (NAO-PC)61-64 for s,p-orbitals or the multipole model21 for s,p,d-orbitals. With the multipole approach, the MEP is obtained using the equation

V(r)R1 ^

αμ α∇R1

31

αβ^Θ α∇ β∇

R1 (2.2)

In this case, q, αμ^

and αβ^Θ are the operators for monopole, dipole and quadrupole,

respectively. R is the distance between the multipole center and the MEP point, and ∇ is the nabla operator.

15

Page 30: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 2

Up to now, different models have been proposed for calculating ΔGsolv. In 1975, Hine and Mookerje65 suggested an approach using the fragment additive's hypothesis. In 1986, Eisenberg and MacLachlan, focusing on the solvent accessible surface area, described another method66. In 1997, Hawkins, Cramer and Truhlar67, based on geometry-dependent atomic surface tensions associated with implicit electrostatics, made available to the scientific community, especially computational chemistry, a model for determining ΔGsolv exclusively in water. In 2005, Clark et al.68 developed an approach based entirely on the local properties of the molecule at the molecular surface for determining ΔGsolv. In the solvation model presented here, ΔGsolv is represented as the sum of the electrostatic energy (ΔGelec; evaluated with the SCRF technique) and the free energy at the molecular surface, namely the local surface tension contribution to ΔGsolv at the molecular surface (ΔGsurf; determined using the surface-integral model (SIM) technique).

2.1.1 Surface-Integral Models (SIMs)

In the SIM approach, a physical property is estimated by integrating one or more local properties over the molecular surface, which can be either an isodensity69 (iso) or a spherical-harmonic surface70 (sphh). Surface-integral models are expressed as

iiL

iL

iL

iL

ntri

i

i AEAIEVfP ⋅==

),,,,(1

ηα . (2.3)

P is the target property and f a polynomial function of the five local properties (the electrostatic potential (V), the local ionization energy (IEL), the local electron affinity (EAL), the local polarizability ( L), and the local hardness ( L)) where the summation is performed by running over all ntri triangles that constitute the molecular surface. The superscript i refers to the value of the concerned local property evaluated at the center of the surface triangle i of area iA . The polynomial function mentioned above is determined through a multiple linear regression using pre-calculated sums of the individual components of the functions listed in Table A1 of the Appendix. 2.1.1.1 Isodensity surface

An iso69 is any portion of a space constituted by a set of points having a common electron density ( ( )rρ ) value. ( )rρ is an estimate of the probability of the presence of an electron in a place particularly defined. In quantum chemistry, ( )rρ is defined as a function of space coordinates r, so that ( )rρ dr can be associated to the number of electrons present in a small volume element dr. In the specific case of closed-shell molecules, ( )rρ can be literally expressed as a summation of the various products obtained from the basis functions,φ :

16

Page 31: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Self-consistent Reaction Field Calculations

( ) ( ) ( )rrr ν

μ νμμν φφρ Ρ= . (2.4)

Here P represents the corresponding density matrix. 2.1.1.2 Spherical-harmonic surface

Sphh70 derive from a fitting to spherical-harmonic expansions, which are based on the following spherical-harmonic function

( ) ( )( )( ) ( ) φθ

πφθ imm

lm

l ePml

mllY cos!4

!12,+

−+= . (2.5)

In the above function, the two integers m and l are quantum numbers that give the number and the spatial arrangement of different nodes present in every function70. Their values vary from m = -1,-l+1,…,0,…,l. θcosm

lP are the associated Legendre functions71,72. θ is the angle formed with the direction of the equatorial plane, and represents the angle obtained when the reference is taken from any direction chosen inside the plane. 2.2 Methods

NAO-PC61-64 models are limited to s and p orbitals. The development of the multipole model, in VAMP73 (the basis of ParaSurf74), for calculating the solvent effect in the SCRF was a main goal. An implementation of an atomic multipole model (up to quadrupole) for calculating the electrostatic properties of molecules, based on electron densities derived from MNDO-like NDDO-based semi-empirical molecular orbital (MO) calculations with minimal s,p,d valence basis sets, was carried out in the SCRF routine through VAMP 9.073. Structures obtained from the literature were converted from 2D structures to 3D MDL SD files using Molecular Networks’ CORINA.75,76 Geometries were optimized in the gas phase using the AM1, AM1*, MNDO/d, or PM3 Hamiltonians with VAMP 9.0.73 Solvent effects were calculated by default using the SCRF for the ground and excited states, and ΔGelec was determined by summing the energies of interaction between the solute and the solvent obtained from the SCRF calculations. The local surface properties were calculated using ParaSurf0974 for either an iso69 or a sphh70 through a marching-cube77 or a shrink-wrap algorithm78, respectively. The MEPs were calculated using either the NAO-PC61-64 technique for s,p-orbitals or the multipole model technique for s,p,d-orbitals. Through the leave-one-out cross-validation, multiple linear regression analyses were performed with Tsar 3.379. With this approach, the predictive R2cv values should be close to corresponding R values. The isodensity value of 0.008 e-Å-3, which is relatively equivalent to a van der Waals surface, was used simultaneously with the marching-cube algorithm to obtain the iso69 necessary to

17

Page 32: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 2

generate the surface-integral models. Spherical-harmonic expansions were fitted at an isodensity value of 0.0003 e-Å-3, which corresponds to the default van der Waals surface in ParaSurf74 for a spherical-harmonic fit, to obtain the sphh70 necessary to generate regression models derived from the shrink-wrap method. The statistical performances of the models are expressed by the regression coefficient R, the correlation coefficient R2, the leave-one-out cross-validated correlation coefficient R2cv, the mean signed error (MSE), the mean unsigned error (MUE), and the root-mean-square deviation between experiment and prediction (RMSD). They are presented below in the plots of experimental and predicted values of the physical properties and in the summary tables. 2.3 Results

2.3.1 Free Energy of Solvation

The data sets used for the free energies of hydration (385 species including 12 anions and 11 cations), the free energies of solvation in octanol (168 neutral compounds), and the free energies of solvation in chloroform (87 neutral compounds) were obtained from the University of Minnesota database80, and are presented in Tables A2, A4, and A5 of the Appendix, respectively. The ΔG solv is obtained by the equation

The electrostatic effect arises from creating an interaction force between the solvent and the solute. It is summarized as an electric polarization of the solvent by the polar or non-uniform charge distribution of the solute, and can also manifest itself by a distortion of the solute by the polarized solvent.

2.3.1.1 Local Properties

The systematic use of quantum mechanical properties allows calculating the descriptors, which in reality are statistical variables describing the distribution of V, IEL, EAL,

L, L, and the local electronegativity ( L). The local properties calculated at the surface of doxycycline are shown in Figure 2.1.

.elecsurfsolv GGG Δ+Δ=Δ )6.2(

18

Page 33: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Self-consistent Reaction Field Calculations

Molecular Electrostatic Potential Ionization Energy

Electron Affinity Molecular Polarizability

Hardness Electronegativity

Figure 2.1. Local property surfaces for doxycycline calculated with ParaSurf.

19

Page 34: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 2

2.3.1.2 Free Energy of Solvation in Water A total data set80 of 385 compounds was used to produce a series of eight models for

the solvation free energy in water (ΔGsolv(H2O)), using AM1, AM1*, MNDO/d, and PM3 Hamiltonians for calculating ΔGelec. The iso69 and the sphh70 are necessary for determining the local properties. The performances of the different models generated are listed in Table 2.1, providing the statistics obtained for ΔGsolv(H2O) for the entire data set.

Table 2.1. Statistics for the eight models generated with the entire data set for ΔGsolv(H2O)

Model Training set MUE RMSD R2 R2cv

AM1 (iso) 0.86 1.16 0.99 0.81AM1 (sphh) 1.04 1.31 0.99 0.82AM1* (iso) 1.54 2.06 0.98 0.75

AM1* (sphh) 1.49 1.98 0.98 0.71MNDO/d (iso) 1.62 2.16 0.99 0.68

MNDO/d (sphh) 1.71 2.25 0.98 0.67PM3 (iso) 1.14 1.47 0.99 0.77

PM3 (sphh) 1.24 1.58 0.99 0.72

For AM1 and PM3, the use of the sphh70 reduces the predictive power, and there is an increase in MUE and RMSD of ≈ 0.20 ΔGsolv unit. For AM1* and MNDO/d, there is much less of a change (no significant change) in the predictive power of ΔGsolv(H2O) when using the iso69 or the sphh70. One of the most significant observations here is the higher values of the RMSD, which range from 1.16 to 2.25 kcal mol-1 (i.e., more than one ΔGsolv unit), with an average RMSD of 1.75 kcal mol-1. Paradoxical to the high RMSD values obtained there is a fairly strong correlation, with the values of the correlation coefficients ranging from 0.98 to 0.99. This implies that there is a real problem of fixing for these models, which could be caused either by the data quality, one of the various local properties used, or the molecular surface. Thus, to get a fixed idea about the role played by any of the parameters listed above, a histogram (Figure 2.2) of the obtained RMSD based on Hamiltonians and different surfaces has been constructed.

20

Page 35: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Self-consistent Reaction Field Calculations

Figure 2.2. Hamiltonians and surfaces versus RMSD.

The histogram shows that, in contrast to AM1 and PM3 Hamiltonians, there is a high percentage of the RMSD for AM1* and MNDO/d when the iso69 and the sphh70 are used successively. This is probably due to the lack of L that is not implemented in ParaSurf74 for these Hamiltonians. From the histogram, only the fundamental role played by L on the models is revealed, but there is no accurate information about the data quality or the molecular surface. For this reason, the graph of the experimental versus the predicted values

ΔGsolv(H2O) presented in Figure 2.3 was made.

Figure 2.3. Experimental and calculated ΔGsolv(H2O) using the iso and the AM1 Hamiltonian for the entire training data set. N = 385, MSE = 0.00, MUE = 0.86, RMSD = 1.16, R2 = 0.99,

R2cv = 0.81.

21

Page 36: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 2

The experimental and calculated values are listed in Table A2 of the Appendix. Figure 2.3 shows clearly the role played by each type of molecule (neutral and ionic).

The sizeable gap between the ΔGsolv(H2O) of ions and those of neutral molecules leads to the formation of two clusters (one consisting of neutral molecules with points sufficiently close to each other; the other, far enough away and formed by the ionic molecules). This can be due to the fact that the ΔGsolv(H2O) of ions are strongly influenced by large electrostatic contributions. Because the data structures shown in Figure 2.4 are poorly adapted for linear regression, all compounds without permanent charges were selected from the data set80 and used to generate other models for ΔGsolv(H2O). This consisted of 362 compounds all in their neutral forms.

The local contribution to ΔGsolv(H2O) was determined using equation 2.7, which was obtained by performing a multiple linear regression of the five local properties calculated with ParaSurf74 for the iso69.

The above equation obtained with the AM1 Hamiltonian is in the form of ax+by+c

and contains 21 terms. V, L and L each play a direct role in the model. V appears in 16 of the 21 terms, confirming the significant role of V in the prediction of intermolecular interaction energies. EAL, L, IEL and L appear each in eight terms.

In order to know if there is a risk of overtraining the models, the free energy data for

the neutral compounds was randomized, and 75% of the data were used to construct another model, using the same procedure as above. The resulting regression equation consists of 18 terms and gave R = 0.94 and R2cv = 0.86. The resulting equation of the model obtained from the 362 neutral compounds contains 21 terms, with R = 0.92 and R2cv = 0.85. Another

[ ][ ] [ ][ ] [ ][ ] [ ]

[ ][ ] [ ]

[ ][ ]

[ ] [ ][ ] [ ][ ]

049879.0)()()()(106954.6)()()(109589.3

)()()(103501.1)()()(100478.1

)()()(107618.2)()()(102734.5

)()()(108282.3)()()(102124.1

)()()(109030.3)()()(105554.9

)()(103925.7)()(102172.4

)()(102025.7)()(105985.4

)()(102702.1)()(105446.3

)(102847.9)(102882.6

)(108558.5)(101805.6

)(105397.1))(),((

7324

320312

2510323

25198

3156

25152

56

25115

2132

38

21331

23138

212

2

−⋅⋅⋅⋅×+⋅⋅⋅×

+⋅⋅⋅×+⋅⋅⋅×

−⋅⋅⋅×+⋅⋅⋅×

+⋅⋅⋅×−⋅⋅⋅×

+⋅⋅⋅×+⋅⋅⋅×

−⋅⋅×−⋅⋅×

−⋅⋅×−⋅⋅×

−⋅⋅×+⋅⋅×

+⋅×+⋅×

+⋅×−⋅×

−⋅×−=Δ

−−

−−

−−

−−

−−

−−

−−

−−

−−

−−

rrrEArVrrEArIE

rrEArVrrEArV

rrEArVrrIErV

rrIErVrrIErV

rrIErVrrIErV

rrIErrV

rEArVrEArV

rEArVrIErV

rr

rrV

rVrneutralOHGf

LLLLLL

LLLL

LLLL

LLLL

LLLL

LLL

LL

LL

LL

L

surf

ηαηηα

αη

ηη

αα

ηα

ηα

α

)7.2(

22

Page 37: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Self-consistent Reaction Field Calculations

remarkable fact is that only five real variables are used68 ultimately, thus over-fitting should not be a problem.

The plot of the predicted as a function of the experimental values of ΔGsolv(H2O) for the neutral compounds is shown in Figure 2.4.

Figure 2.4. Best model of ΔGsolv(H2O) for the neutral compounds obtained with the iso and the AM1 Hamiltonian. N = 362, MSE = 0.00, MUE = 0.67, RMSD = 0.87, R2 = 0.92, R2cv =

0.85.

There is no strong outlier, suggesting the robustness of the model. Moreover, in contrast to the model obtained with the total data set, this model seems to fit correctly with linear regression.

Table A3 of the Appendix contains the experimental and predicted values.

Table 2.2 contains the statistical values for ΔGsolv(H2O) for the neutral compounds.

23

Page 38: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 2

Table 2.2. Statistical significances of all models obtained with the neutral compounds for

ΔGsolv(H2O) Model Training set

MUE RMSD R2 R2cv AM1 (iso) 0.67 0.87 0.92 0.85

AM1 (sphh) 0.72 0.97 0.90 0.78AM1* (iso) 0.84 1.10 0.87 0.78

AM1* (sphh) 1.08 1.39 0.80 0.62MNDO/d (iso) 0.85 1.10 0.87 0.81

MNDO/d (sphh) 1.12 1.44 0.79 0.67PM3 (iso) 0.72 0.95 0.91 0.85

PM3 (sphh) 0.81 1.10 0.88 0.79

AM1* and MNDO/d, associated with the sphh70, reduce the predictive power, and in contrast with the iso69, there is an increase in RMSD of ≈ 0.30 kcal mol-1. For AM1 and PM3, there is less of a change in RMSD, of about 0.10 ΔGsolv unit, when the iso69 is replaced by the sphh70. The average RMSD for all the models is 1.12 kcal mol-1 for the neutral compounds, in contrast to the ionic and neutral compounds, which yielded an average RMSD of 1.75 kcal mol-1. There is a decrease of ≈ 0.60 ΔGsolv unit when the ionic compounds are removed (i.e., subtracting the ionic compounds from the entire data leads to a reduction of the RMSD).

The effect of the molecular surface on the models is portrayed in the histogram below

(Figure 2.5) for the different models obtained with all molecules (ionic + neutral) and the neutral molecules selected from the total data set80.

Figure 2.5. Hamiltonians and surfaces versus RMSD for ionic + neutral compounds and the neutral compounds exclusively.

According to the histogram, the higher frequencies are obtained for models generated

with the ionic and neutral compounds. It is clear that for the neutral compounds, the RMSD frequencies always increase with the sphh70. The use of the neutral compounds provides more

24

Page 39: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Self-consistent Reaction Field Calculations

information about the dependence of the solvation models in water on the molecular surface. This singular information is that with the sphh70, there is always an increase in the RMSD, thus a decrease of the predictive power. Solvation models for water generated with the neutral compounds are totally surface-dependent, in contrast to models obtained with both ionic and neutral compounds.

In order to check the behavior of this new approach with solvents other than water,

respective models of the free energy of solvation with octanol (ΔGsolv(octanol)) and chloroform (ΔGsolv(CHCl3)) were generated.

2.3.1.3 Free Energy of Solvation in Octanol

Selecting 168 compounds whose experimental values of ΔGsolv(octanol) are known

from the total data set80, models for ΔGsolv(octanol) were developed. The regression equation based on the raw data for the local contribution to

ΔGsolv(octanol) at the molecular surface using the iso69 is shown in equation 2.8 for the AM1 Hamiltonian.

This equation contains 15 terms, among these V appears in 12 terms and is the dominant term as for ΔGsolv(H2O). L, EAL, IEL and L appear in eight, six, five and five terms, respectively. As for the solvation model in water for the neutral compounds, V, L and

L play direct roles in this model.

Figure 2.6 shows the performance of the model developed for ΔGsolv(octanol) using the AM1 Hamiltonian and the iso69.

[ ] [ ]

[ ] [ ][ ] [ ][ ] [ ][ ]

[ ]63929.0

)()()()(103417.1)()()(106487.1

)()()(100364.2)()()(105242.3

)()()(108954.3)()()(106302.4

)()()(101301.3)()()(101331.1

)()(105672.1)()(102449.4

)()(100106.1)()(100587.1

)(105926.1)(107847.2

)(109463.4)))(tan((

25166

4213

323238

3212517

251238

24

39231

3

+⋅⋅⋅⋅×+⋅⋅⋅×

−⋅⋅⋅×−⋅⋅⋅×

+⋅⋅⋅×+⋅⋅⋅×

+⋅⋅⋅×+⋅⋅⋅×

−⋅⋅×−⋅⋅×

−⋅⋅×−⋅⋅×

+⋅×+⋅×

−⋅×=Δ

−−

−−

−−

−−

−−

−−

−−

rrrEArVrrrIE

rrEArVrrEArV

rrIErVrrIErV

rEArIErVrEArIErV

rrVrrV

rrVrEArVrr

rVrolocGf

LLLLLL

LLLL

LLLL

LLLL

LL

LL

LL

surf

ηαηα

αα

ηα

ηα

αηα

)8.2(

25

Page 40: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 2

Figure 2.6. Schematic view of the model of ΔGsolv(octanol) performed with the iso and the AM1 Hamiltonian. N = 168, MSE = 0.00, MUE = 0.57, RMSD = 0.73, R2 = 0.92, R2cv = 0.84.

All the points are close to each other, and none of them moves away significantly from the straight line obtained from the equation y = ax (a = 1). This implies that there is a fairly good correlation between the calculated and the experimental values.

In Table A4 of the Appendix the experimental and predicted values are presented.

Table 2.3 gives the statistics for ΔGsolv(octanol) for the 168 compounds.

Table 2.3. Measures of performance of all models generated for ΔGsolv(octanol)

Model Training set MUE RMSD R2 R2cv

AM1 (iso) 0.57 0.73 0.92 0.84AM1 (sphh) 0.66 0.91 0.88 0.70AM1* (iso) 0.89 1.16 0.80 0.61

AM1* (sphh) 0.85 1.17 0.79 0.58MNDO/d (iso) 0.81 1.09 0.82 0.74

MNDO/d (sphh) 0.87 1.15 0.80 0.60PM3 (iso) 0.56 0.74 0.92 0.85

PM3 (sphh) 0.64 0.88 0.88 0.79

All the values of the RMSD are below one ΔGsolv unit for AM1 and PM3 Hamiltonians, and moreover, the use of the sphh70 reduces the predictive power, leading to an increase in RMSD of ≈ 0.20 kcal mol-1 compared to the iso69. However, all the values of the RMSD are above one ΔGsolv unit for AM1* and MNDO/d, and no significant change is

26

Page 41: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Self-consistent Reaction Field Calculations

observed in the predictive power of ΔGsolv(octanol) when using the iso69 or the sphh70. High RMSD values are obtained with AM1* and MNDO/d for the two surfaces.

2.3.1.4 Free Energy of Solvation in Chloroform

Surface-integral models for ΔGsolv(CHCl3) at the molecular surface were performed

using 87 compounds that were preliminarily extracted from the total data set80. The resulting equation obtained by performing a multiple linear regression through

Tsar 3.379 on data containing pre-calculated local properties is given below (equation 2.9), for the AM1 Hamiltonian and the iso69.

In the above equation there is a total of eight terms and five of these eight terms contain V. Then, L, IEL, EAL, and L are in four, three, two and two terms, respectively. In this case, V, L, and IEL play a direct role in the model.

The predicted values of ΔGsolv(CHCl3) are given in Figure 2.7 as a function of the

experimental values.

[ ][ ] [ ][ ][ ] 47847.0)()()()(104629.2

)()()(103045.3)()(100234.5

)()(101496.6)()(108137.1

)(104912.6)(101206.1

)(105132.5)))(((

213

526

3152511

314

33

+⋅⋅⋅⋅×

+⋅⋅⋅×−⋅⋅×

−⋅⋅×+⋅⋅×

−⋅×+⋅×

+⋅×=Δ

−−

−−

−−

rrrEArV

rrIErVrrIE

rrVrEArV

rrIE

rVrCHClGf

LLL

LLLL

LL

LL

surf

ηααα

η

α)9.2(

27

Page 42: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 2

Figure 2.7. Graphical representation of ΔGsolv(CHCl3) for AM1 Hamiltonian and the iso. N = 87, MSE = 0.00, MUE = 0.46, RMSD = 0.62, R2 = 0.91, R2cv = 0.74.

In this case, it is clearly observed that a correlation between the experimental and the

predicted values of ΔGsolv(CHCl3) is also manifested.

In Table A5 of the Appendix, the respective experimental and calculated values are presented.

Table 2.4 gives the performances of ΔGsolv(CHCl3) for the 87 compounds.

Table 2.4. Performances of all the solvation free energy models in chloroform Model Training set

MUE RMSD R2 R2cv AM1 (iso) 0.46 0.62 0.91 0.74

AM1 (sphh) 0.59 0.78 0.86 0.70AM1* (iso) 0.64 0.84 0.84 0.63

AM1* (sphh) 0.70 0.92 0.81 0.60MNDO/d (iso) 0.49 0.67 0.90 0.81

MNDO/d (sphh) 0.72 0.95 0.80 0.70PM3 (iso) 0.58 0.83 0.85 0.71

PM3 (sphh) 0.54 0.80 0.86 0.67

In particular, except for AM1* and MNDO/d with the sphh70, where the values of the RMSD are close to one, all the RMSD values for the other models are less than one ΔGsolv unit. The use of the sphh70 reduces the predictive power for AM1 and MNDO/d, for which we observe an increase in RMSD of 0.20 ΔGsolv unit for AM1, and 0.30 ΔGsolv unit for MNDO/d,

28

Page 43: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Self-consistent Reaction Field Calculations

compared to the iso69. For AM1* and PM3, which are weakly affected, there is no significant change in the predictive power of ΔGsolv(CHCl3) when using the iso69 or the sphh70.

The histogram below (Figure 2.8) gives the variable importance for ΔGsolv(H2O), ΔGsolv(octanol), and ΔGsolv(CHCl3), obtained with the AM1 Hamiltonian and the iso69.

Figure 2.8. Summary of the occurrence of each local property in the solvation models generated with AM1 and the iso.

It appears that, for the same Hamiltonian and surface, the models are strongly

dominated by the contribution of V. For QSPR models, V seems to be the most important parameter (i.e., most useful information can be stored in the MEP coefficients), but the other local properties also are strongly required.

The validations of the solvation models previously developed were performed by

calculating different partition coefficients.

2.3.2 The Partition Coefficient: logP

The partitioning of an organic solute between two phases, one aqueous and the other

organic, is an important parameter on which many phenomena of biological and medicinal chemistry, more precisely drug delivery, binding, and clearance, are extremely dependent. A reliable interpretation of the solvation of organic solutes in organic media would significantly impact conformational analysis in the condensed phase and the ability to predict molecular aggregation. The models obtained from the ΔGsolv(H2O), ΔGsolv(octanol), and ΔGsolv(CHCl3) were validated by calculating the octanol-water and chloroform-water partition coefficient logP as

29

Page 44: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 2

where R is the gas constant, T equals 298 K, and )()( 2OHGsolventG solvsolv Δ−Δ is the transfer free energy of a given solute to the specified solvent from water.

2.3.2.1 The Octanol-Water Partition Coefficient: logPow

Lipophilicity is a molecular property that has an active function in the transport of bioactive molecules via their corresponding receptors12 and the environmental fate of organic molecules81. However, lipophilicity, which is usually approximated by the logarithm of the partition coefficient, logP, of a compound determined in the octanol/water system, is a determining factor for QSAR studies82.

2.3.2.1.1 LogPow for Small Molecules

The data set used for calculating the logPow was obtained from the literature83. It consists of 157 small molecules, listed in Table A6 of the Appendix, which were not included in the data set used to fit the models.

The theoretical logPow obtained using the iso69 and the AM1 Hamiltonian is shown graphically in Figure 2.9.

,303.2

O)H()(2/log 2

RTGsolventGOHPsolvent solvsolv Δ−Δ−= )10.2(

30

Page 45: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Self-consistent Reaction Field Calculations

Figure 2.9. Relationship between the theoretical and the experimental logPow for the models

generated with the AM1 Hamiltonian and the iso. N = 157, MSE = -0.09, MUE = 0.46, RMSD = 0.59, R2 = 0.92.

As shown in Figure 2.9, there is a good predictive power of the models for

ΔGsolv(H2O), and ΔGsolv(octanol), when applied to a data set of small compounds.

In Table A6 of the Appendix, the experimental and calculated values of logPow are summarized.

In Table 2.5 the statistics for logPow for the 157 compounds are summarized.

Table 2.5. Performances with a logPow validation Model Validation set

MSE MUE RMSD R2 AM1 (iso) -0.09 0.46 0.59 0.92

AM1 (sphh) -0.04 0.50 0.65 0.90AM1* (iso) -0.05 0.61 0.79 0.86

AM1* (sphh) 0.26 0.59 0.84 0.86MNDO/d (iso) 0.03 0.56 0.75 0.90

MNDO/d (sphh) 0.04 0.41 0.63 0.92PM3 (iso) 0.04 0.42 0.57 0.93

PM3 (sphh) -0.22 0.53 0.70 0.90

The RMSD values vary from 0.57 to 0.84 logPow unit. Statistically this means that the validation of the models by calculating the logPow for a data set of small compounds provides a useful and informative assessment of the likely reliability of the models developed. Very accurate logPow values are obtained with AM1, MNDO/d, and PM3. An analysis of the RMSD obtained with the iso69 and the sphh70 shows that for PM3, the use of the sphh70

31

Page 46: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 2

reduces the predictive power and there is an increase in the RMSD of ≈ 0.20 logPow unit. Opposite to this, with MNDO/d there is a decrease in the RMSD, compared to the iso69. For AM1 and AM1*, there is no significant change in the predictive power of the logPow for small molecules when using the iso69 or the sphh70.

2.3.2.1.2 LogPow for Large Molecules

The large molecule data set used was obtained from Exploring QSAR84. It consists of

1842 molecules without zwitterions. Proceeding as done previously with small molecules, the solvation models developed were used for calculating the logPow for large molecules. However, it became clear that the calculated values were very far from the experimental ones. Thus, for the sake of identifying the true source of the problem, 80% of the logPow data84 of large molecules were randomly selected and used as a training set to build some SIMs, and 20%, listed in Table A7 of the Appendix, were used as a test set. It appears that the models obtained from multiple linear regression analyses predict much better than the solvation models.

Figure 2.10 shows the performance of the logPow model obtained from the SIM

descriptors using the AM1 Hamiltonian and the iso69.

Figure 2.10. Good prediction of the logPow values for a test set by multiple linear regression

analysis. N = 368, MSE = -0.06, MUE = 0.48, RMSD = 0.62, R2 = 0.84, R2cv = 0.77.

The primary information extracted from Figure 2.10 is that there is no point with a particular behavior, for example, a great remoteness from the line of equation y = ax (a = 1) (i.e., linear regression can be fairly accurate for every type of compound).

32

Page 47: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Self-consistent Reaction Field Calculations

The experimental and calculated values are listed in Table A7 of the Appendix.

The statistics for the logPow for the data set of large compounds are listed in Table 2.6 below.

Table 2.6. Statistical performances of the logPow models generated

Model Test set Training set MUE RMSD R2 R2cv

AM1 (iso) 0.48 0.62 0.84 0.77AM1 (sphh) 0.54 0.71 0.79 0.72AM1* (iso) 0.65 0.81 0.73 0.12

AM1* (sphh) 0.75 0.98 0.59 0.46MNDO/d (iso) 0.54 0.70 0.79 0.73

MNDO/d (sphh) 0.56 0.73 0.78 0.70PM3 (iso) 0.51 0.65 0.82 0.75

PM3 (sphh) 0.54 0.69 0.80 0.73

From Table 2.6, it appears that the results obtained from this small investigation are more promising with a multiple linear regression approach. All the values of the RMSD are less than one logPow unit, except for the AM1* with the sphh70, where the RMSD is close to one logPow unit. For AM1*, the use of the sphh70 reduces the predictive power, and there is an increase in MUE of ≈ 0.10 logPow unit and RMSD of ≈ 0.20 logPow unit, compared to the iso69. For AM1, MNDO/d and PM3, there is no significant change in the predictive power of the logPow for large molecules when using the iso69 or the sphh70. For AM1*, seven iodine compounds were also removed because of the poor reproducibility of iodine with AM1*.

2.3.2.2 The Chloroform-Water Partition Coefficient: logPcw

Chloroform, a substituted derivative of methane, is an organic solvent used singularly.

Thanks to this particularity, the chloroform-water partition coefficient (logPcw) has a useful application85 in the prediction of ligand lipophilicity and biological activity of organic ligands. The data set used for calculating the logPcw was obtained from the literature85,86. It consists of 30 compounds from which 23 compounds included in the data used to build the models have been removed. These seven compounds, listed in Table A8 of the Appendix, were then used for the validation process.

In Figure 2.11 the theoretical logPcw obtained using the iso69 and the AM1 Hamiltonian is presented.

33

Page 48: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 2

Figure 2.11. Plot of theoretical logPcw values obtained with the AM1 Hamiltonian and the

iso. N = 7, MSE = 0.3, MUE = 0.46, RMSD = 0.54, R2 = 0.95.

Here, there is an excellent agreement between the experimental and the calculated logPcw values.

The experimental and calculated values are listed in Table A8 of the Appendix.

Table 2.7 provides the statistical significance of logPcw for the seven compounds.

Table 2.7. Performances with a logPcw validation Model Validation set

MSE MUE RMSD R2 AM1 (iso) 0.30 0.46 0.54 0.95

AM1 (sphh) 0.15 0.39 0.59 0.91AM1* (iso) 0.29 0.66 0.81 0.82

AM1* (sphh) 0.93 0.93 1.02 0.94MNDO/d (iso) 0.69 0.95 1.10 0.77

MNDO/d (sphh) 0.65 0.80 0.88 0.90PM3 (iso) 0.12 0.73 0.90 0.84

PM3 (sphh) 0.45 0.49 0.58 0.98

The statistics listed in Table 2.7 allow us to say that the models obtained are reliable enough for predicting ΔGsolv(CHCl3) accurately. For AM1*, the use of the sphh70 reduces the predictive power, and there is an increase in RMSD of ≈ 0.20 logPcw unit, compared to the iso69. For MNDO/d and PM3, the use of the sphh70 increases considerably the predictive power, but there is no substantial change for AM1. One particularity here is that the best predictions (R2 = 0.98 and R2 = 0.94) are obtained with the sphh70.

34

Page 49: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Self-consistent Reaction Field Calculations

2.4 Discussion

As mentioned in the introduction, the main objective of this research project was to develop some accurate models for predicting ΔGsolv of compounds in different organic solvents. During this investigation, the study of the solvent effects on calculating ΔGsolv of some organic compounds was carried out. It has been proven that ΔGsolv is made up of two parts: the electrostatic and the nonpolar contributions. The electrostatic contribution, which is charge-method dependent, was evaluated using the SCRF calculations. It was determined that solvent effects are more significant for ionic compounds; therefore, ΔGsolv of ions are much larger than those of neutral solutes and are dominated by large electrostatic contributions. The nonpolar part, which is surface-method dependent, could be evaluated using the multiple linear regression technique.

Calculating the theoretical logPow using the standard formula gives a very good correlation between the experimental and the calculated values for small molecules, as shown in Table 2.5. Here the mean unsigned errors and the RMSDs are uniformly small. The main exception being AM1*, which gave MUE and RMSD of 0.61, 0.79 and 0.59, 0.84, for the iso69 and the sphh70, respectively. However, this approach has some problems when applied to large-sized molecules as it was also mentioned in the work of P. Kollman et al.87 The reason can be due to the interior atoms where many of these atoms are completely buried, so the model cannot provide very accurate ΔGsolv. The solvation models presented here were also performed with small molecules. Maybe the use of large molecules (whose experimental values of ΔGsolv are not easy to obtain) for generating solvation models can improve the accuracy of the theoretical logP for large molecules.

The major advantage of these solvation models is that they can be used for calculating

accurately the logPow or the logPcw for a data set of small molecules. Therefore, these models can be widely applicable for small- and medium-sized compounds where few atoms are totally buried. Their disadvantage lies in the fact that their scope is extremely limited. Hence, they cannot be applied to a data set of large molecules. What can be immediately retained from this is that, to use the solvation free energies obtained from these models as the parameters in QSAR studies, one must first localize properly the data quality that should be use.

To avoid the difficulties mentioned above, it would be better to calculate the logPow

for large molecules by linear regression. This gives, in general, a good correlation between the experimental and the calculated values. Table 2.8 compares the theoretical and the simulated logPow obtained with AM1 and AM1* for large molecules.

35

Page 50: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 2

Table 2.8. Comparison of the performances of the logPow obtained for large molecules

Model Compounds of the test set logPow(calc) logPow(sim)

MSE MUE RMSD R2 MSE MUE RMSD R2 AM1 (iso)

-0.36 0.80 1.15 0.53 -0.06 0.48 0.62 0.84

AM1 (sphh)

-0.017 1.47 3.37 0.25 -0.06 0.54 0.71 0.79

AM1* (iso)

-0.47 1.72 2.77 0.08 -0.02 0.65 0.81 0.73

AM1* (sphh)

-0.14 1.82 2.66 0.11 -0.09 0.75 0.98 0.59

The experimental, the theoretical and the predicted values of logPow obtained using the

AM1 Hamiltonian, the iso69 and the sphh70 for the large molecules sulfanilamide and 3,5-diiodosalicylic acid are listed in Table 2.9.

OH

OOH

I

I

Figure 2.12. Sulfanilamide. Figure 2.13. 3,5-Diiodosalicylic acid.

Table 2.9. Comparison of logPow values obtained from different models for selected compounds of the test set

Compound Experiment Model Water-octanol models

SIMlogPow models

Sulfanilamide -0.62 AM1 (iso) 3.77 0.003AM1 (sphh) -5.09 -0.96

AlogPs -0.163,5-

Diiodosalicylic acid

4.56 AM1 (iso) 9.48 3.32AM1 (sphh) 43.45 3.33

AlogPs 3.13

For the sulfanilamide, the deviations between the experimental and the theoretical logPow are 4.39 and 4.47 for the iso69 and the sphh70, respectively, and these deviations are 0.62 and 0.34, respectively, for the predicted logPow using multiple linear regression.

36

Page 51: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Self-consistent Reaction Field Calculations

For the 3,5-diiodosalicylic acid, the deviations between the experimental and the theoretical logPow are 4.92 and 38.9 for the iso69 and the sphh70, respectively, and these deviations are 1.24 and 1.23, respectively, for the predicted logPow using multiple linear regression.

Compared to the logPow obtained from the SIMs, the logPow of these compounds

calculated directly from the solvation models are significantly worse because of the reason mentioned previously. There is not a big difference between the logPow obtained from the SIMlogPow models and those obtained from a publicly available model (AlogPs). The surface plays an important role in calculating the nonpolar contribution of ΔGsolv and in predicting the logPow. The logPow obtained by linear regression is thus surface-method dependent as is the nonpolar part of ΔGsolv.

Three main parameters (the local property, the data quality, and the molecular surface) seem to govern the predictive ability of these solvation models. The absence of L for AM1* and MNDO/d affects the calculations performed with these Hamiltonians, which generally give RMSD values higher than those obtained with AM1 and PM3. Due to the fact that these Hamiltonians are particularly suited for calculations regarding compounds containing up to d-orbitals, the atomic parameters used for these Hamiltonians can also be the source of the large errors obtained. AM1* is an extension of the AM1 semi-empirical molecular technique, and in its parameterization process, the addition of heats of reaction can cause a considerable impact on the heats of formation, which thus should be weighted with some caution, and therefore less heavily.

Although V dominates consistently as shown in Figure 2.8, all of the five local properties give sufficient information for a full description of a molecule and the intermolecular binding properties related to this compound. V and L are the terms that always appear singularly in each of the solvation models for water, octanol and chloroform. In contrast, the other local properties (EAL, IEL and L) appear either in a single form for a typical model or in a combination with other local properties. Within an SCRF approach, V and L seem to be the main factors responsible for the interaction between the dissolved molecule and the solvent. One particularity is observed for water; V and L appear two times in a single form (equation 2.7), contrary to the models for octanol (equation 2.8) and chloroform (equation 2.9) where they appear only one time in a single form. This can be due to the fact that for water there may be other specific kinds of interactions (Lewis-base, hydrogen-bond donor, hydrogen-bond acceptor, etc.) than for the other organic solvents.

The average of the correlation coefficients of the solvation models developed are 0.87, 0.85, and 0.85 for water (neutral molecules), octanol and chloroform, respectively. This shows that these models are sufficiently robust and normally they can be used for a further derivation of some properties. Another highlight is that their correlation coefficients, presented in Tables 2.2, 2.3, and 2.4, have reasonable magnitudes and similarities for the same Hamiltonian and surface (i.e., there is not a big difference between the different values). This is proof of reliability for these models because each Hamiltonian is characterized by its own atomic parameters, which in some exceptional cases are the same for different Hamiltonians. The MUE values range from 0.67 to 1.12 kcal mol-1 for ΔGsolv(H2O), 0.56 to 0.89 kcal mol-1 for ΔGsolv(octanol), and 0.46 to 0.72 kcal mol-1 for ΔGsolv(CHCl3). Compared to some currently available models, such as the SMn solvation models of Truhlar et al.88,89 for which the MUE is 0.49 kcal mol-1 or Friesner et al.90 who with their GB/PSA models obtained

37

Page 52: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 2

a MUE of 0.60 kcal mol-1, these models are not as uniformly accurate as the above models. Since the models are sufficiently reliable for predicting solvation free energies in a specific theoretical framework, the fact that they also depend on the gas-phase electron density can allow for an indirect inclusion of solute polarization via 68. These models still face another major problem, the lack of physical interpretation, which is and remains a unique feature of QSAR and QSPR models derived from surface integrals. Fundamentally, the issue of finding an adequate physical interpretation for activity or property models is currently a priority in the execution of desirable properties68. 2.5 Conclusions

Combining the pure Coulomb ΔGsolv from SCRF calculations with a local term calculated as the surface-integral of a function of local properties leads to a robust model for the heats of solvation that can be validated by checking its performance in predicting logPow or logPcw. The error estimates for some individual compounds from calculating their different logPow helped in identifying clearly the type of compounds for which the models should be less reliable. The results obtained vary for different Hamiltonians and the surfaces used, but are quite good according to the correlation coefficients. The quality of the surface used for calculating the molecular descriptors plays an important role in predicting the target property. In general, the use of the iso69 gives better results than the sphh70. V and L play special roles in the solvation models, but also some significant roles in combination with one or more of the other local properties. From the results obtained, it is now possible to say that ΔGsolv can be widely considered as a local property for certain kinds of organic compounds, and the local ΔGsolv(H2O) seems to be the main parameter for the study of intermolecular interaction sites. Good agreement with experimental results has been obtained especially for the AM1 Hamiltonian, which gives more accurate results. More importantly, although not having the atomic parameters for compounds containing s, p, and d- orbitals, it was observed that, in general, the lowest RMSD and the highest R2 are obtained with AM191, which is the standard NDDO92-based semi-empirical MO theory, and the iso69. The question remains: Is AM1 the most appropriate Hamiltonian for such QSPR studies?

38

Page 53: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Self-consistent Reaction Field Calculations

2.6 References 1. Hehre, W.; Radom, L.; Schleyer, P. v. R.; Pople, J. A. Ab Initio Molecular Orbital Theory;

Wiley: New York, 1986.

2. Clark, T. Modelling the chemistry: Time to Break the Mould? EuroQSAR 2002: Designing drugs and crop protectants; Ford, M., Dearden, J., Eds.; 2003; pp 111-121.

3. Hansch, C.; Maloney, P. P.; Fujita, T.; Muir, R. M. Nature. 1962, 194, 178. 4. Hansch, C.; Fujita, T.; J. Am. Chem. Soc. 1964, 86, 1616. 5. Ivanciuc, T.; Ivanciuc, O.; Klein, D. J. Posetic quantitative superstructure/activity

relationships (QSSARs) for chlorobenzenes. J. Chem. Inf. Model. 2005, 45, 870-879. 6. Ivanciuc, T.; Ivanciuc, O.; Klein, D. J. Modeling the bioconcentration factors and

bioaccumulation factors of polychlorinated biphenyls with posetic quantitative super-structure/activity relationships (QSSAR). Mol. Divers. 2006, 10, 133-145.

7. Ivanciuc, T.; Ivanciuc, O.; Klein, D. J. Prediction of environmental properties for

chlorophenols with posetic quantitative super-structure/property relationships (QSSPR). Int. J. Mol. Sci. 2006, 7, 358-374.

8. Gonzalez-Diaz, H.; Gonzalez-Diaz, Y.; Santana, L.; Ubeira, F. M.; Uriarte, E. Proteomics,

networks and connectivity indices. Proteomics. 2008, 8, 750-778. 9. Leo, A.; Hansch, C.; Elkins, D. Partition Coefficients And Their Uses. Chem. Rev. 1971, 71,

524-616. 10. Hansch, C.; Leo, A. Exploring QSAR: Fundamentals and Applications in Chemistry and

Biology; American Chemical Society: Washington, DC, 1995. 11. Leo, A. J. ClogP; Daylight Chemical Information Systems: Irvine, CA, 1991. 12. Viswanadhan, V. N.; Reddy, M. R.; Bacquet, R. J.; Erion, D. M. Assessment of Methods Used

for Predicting Lipophilicity: Application to Nucleosides and Nucleoside Bases. J. Comput. Chem. 1993, 9, 1019-1026.

13. Klopman, G.; Li, J.-Y.; Wang, S.; Dimayuga, M. Computer Automated logP Calculations

Based on an Extended Group Contribution Approach. J. Chem. Inf. Comput. Sci. 1994, 34, 752-781.

14. Benson, S. W. Thermochemical Kinetics, 2nd ed.; Wiley: New York, 1976. 15. Clark, T.; McKervey, M. A. Saturated Hydrocarbons. In Comprehensive Organic Chemistry;

Barton, D. H. R.; Ollis, W. D., Eds. Pergamon Press: Oxford, 1979; Vol. 1, Chapter 2, pp 37-120. 16. Cohen, N.; Benson, S. W. In Chemistry of alkanes and cycloalkanes., Patai, S.; Rappoport, Z.,

Eds. Wiley: Chichester, 1992; Chapter 6, p 215. 17. Stein, S. E.; Brown, R. L. Estimation of normal boiling points from group contributions. J.

Chem. Inf. Comput. Sci. 1994, 34, 581-587.

39

Page 54: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 2

18. Constantinou, L.; Gani, R. New group contribution method for estimating properties of pure compounds. AICHe J. 1994, 40, 237-244.

19. Klopman, G.; Wang, S.; Balthasar, D. M. Estimation of aqueous solubility of organic

molecules by the group contribution approach. Application to the study of biodegradation. J. Chem. Inf. Comput. Sci. 1992, 32, 474-482.

20. Kuhne, R.; Ebert, R.-U.; Kleint, F.; Schmidt, G.; Schuurmann, G. Group contribution methods

to estimate water solubility of organic chemicals. Chemosphere. 1995, 30, 2061-2077. 21. Horn, A. H. C.; Lin, Jr-H.; Clark, T. Theor. Chem. Acc. 2005, 114, 159-168. 22. Rauhut, G.; Clark, T.; Steinke, T. J. Am. Chem. Soc. 1993, 115, 9174-9181. 23. Tomasi, J.; Persico, M. Chem. Rev. 1994, 94, 2027. 24. Cramer, C. J.; Truhlar, D. G. Chem. Rev. 1999, 99, 2161. 25. (a) Huron, M.-J.; Claverie, P. J. Phys. Chem. 1972, 76, 2123. (b) Huron, M.-J.; Claverie, P. J.

Phys. Chem. 1974, 78, 1853, 1862. 26. (a) Gomez-Jeria, J. S.; Conteras, R. R. Int. J. Quantum Chem. 1986, 15, 591. (b) Gomez-Jeria,

J. S.; Morales-Lagos, D. J. Phys. Chem. 1990, 94, 3790. (c) Morales-Lagos, D.; Gomez-Jeria, J. S. J. Phys. Chem. 1991, 95, 5308.

27. (a) Fox, T.; Rösch, N.; Zauhar, R. J. J. Comput. Chem. 1993, 14, 253. (b) Fox, T.; Rösch, N.

Chem. Phys. Lett. 1992, 191, 33. (c) Zauhar, R. J.; Morgan, R. S. J. Comput. Chem. 1988, 9, 171. 28. (a) Klopman, G. Chem. Phys. Lett. 1967, 1, 200. (b) Klopman, G.; Andreozzi, P. Theor. Chim

Acta. 1980, 55, 77. 29. (a) Germer, H. A. Theor. Chim. Acta. 1974, 34, 145. (b) Germer, H. A. Theor. Chim. Acta.

1974, 35, 273. 30. (a) Miertus, S.; Kysel, O. Chem. Phys. 1977, 21, 27, 33, 47. (b) Duben, A. J.; Miertus, S.

Theor. Chim. Acta. 1981, 60, 327. 31. Born, M. Z. Phys. 1920, 1, 45. 32. Mulliken, R. S. J. Chem. Phys. 1955, 23, 1833. 33. (a) Cramer, C. J.; Truhlar, D. G. J. Am. Chem. Soc. 1991, 113, 8305. (b) Cramer, C. J.;

Truhlar, D. G. J. Am. Chem. Soc. 1991, 113, 8552. (c) Cramer, C. J.; Truhlar, D. G. Science. 1992, 256, 213. (d) Cramer, C. J.; Truhlar, D. G. J. Comput. Chem. 1992, 13, 1089.

34. (a) Miertus, S.; Scrocco, E.; Tomasi, J. Chem. Phys. 1981, 55, 117. (b) Miertus, S.; Tomasi, J.

Chem. Phys. 1982, 65, 239. (c) Bonaccorsi, R.; Cimiraglia, R.; Tomasi, J. J. Comput. Chem. 1983, 4, 567. (d) Tomasi, J.; Alagona, G.; Bonaccorsi, R.; Ghio, G. In Modelling of Structure and Properties of Molecules., Maksic, Z. B., Ed. Wiley: New York, 1987; p 330. (e) Floris, F.; Tomasi, J. J. Comput. Chem. 1989, 10, 616. (f) Bonaccorsi, R.; Cammi, R.; Tomasi, J. J. Comput. Chem. 1991, 12, 301. (g) Tomasi, J.; Bonaccorsi, R.; Cammi, R.; Olivares del Valle, F. J. J. Mol. Struct. (THEOCHEM) 1991, 234, 401. (h) Tunon, I.; Silla, E.; Tomasi, J. J. Phys. Chem. 1992, 96, 9043.

40

Page 55: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Self-consistent Reaction Field Calculations

35. (a) Aguilar, M. A.; Olivares del Valle, F. J. Chem. Phys. 1989, 129, 439. (b) Aguilar, M. A.; Olivares del Valle, F. J. Chem. Phys. 1989, 138, 327. (c) Olivares del Valle, F. J.; Tomasi, J. Chem. Phys. 1991, 150, 139. (d) Aguilar, M. A.; Olivares del Valle, F. J.; Tomasi, J. Chem. Phys. 1991, 150, 151. (e) Tolosa, S.; Esperilla, J. J.; Olivares del Valle, F. J. J. Comput. Chem. 1990, 11, 576. (f) Cammi, R.; Olivares del Valle, F. J.; Tomasi, J. Chem. Phys. 1988, 122, 63. (g) Olivares del Valle, F. J.; Aguilar, M. A. J. Comput. Chem. 1992, 13, 115.

36. Hawkins, G. D.; Cramer, C. J.; Truhlar, D. G. J. Phys. Chem. 1996, 100, 19 824-19 839. 37. Cramer, C. J.; Truhlar, D. G. J. Comput.-Aided Mol. Des. 1992, 6, 629-666. 38. Hawkins, G. D.; Cramer, C. J.; Truhlar, D. G. Chem. Phys. Lett. 1995, 246, 122-129. 39. Jayaram, B.; Sprous, D.; Beveridge, D. L. J. Phys. Chem. B 1998, 102, 9571-9576. 40. Still, W. C.; Tempczyk, A.; Hawley, R. C.; Hendrickson, T. J. Am. Chem. Soc. 1990, 112,

6127-6129. 41. Beveridge, D. L.; Dicapua, F. M. Annu. Rev. Biophys. Chem. 1989, 18, 431-492. 42. Sitkoff, D.; Sharp, K. A.; Honig, B. J. Phys. Chem. 1994, 98, 1978-1988. 43. Luo, R.; Moult, J.; Gilson, K. J. Phys. Chem. B 1997, 101, 11 226-11 236. 44. Dolney, D. M.; Hawkins, G. D.; Winget, P.; Liotard, D.; Cramer, C. J.; Truhlar, D. G. J.

Comput. Chem. 2000, 340-366. 45. Klamt, A.; Schüürmann, G. J. Chem. Soc., Perkin Trans. 1993, 2, 799-805. 46. Barone, V.; Cossi, M.; Tomasi, J. J. Comput. Chem. 1998, 19, 404-417. 47. (a) Pascual-Ahuir, J. L.; Silla, E.; Tomasi, J.; Bonaccorsi, R. J. Comput. Chem. 1987, 8, 778.

(b) Pascual-Ahuir, J. L.; Silla, E. J. Comput. Chem. 1990, 11, 1047. (c) Silla, E.; Tunon, I.; Pascual- Ahuir, J. L. J. Comput. Chem. 1991, 12, 1077.

48. (a) Connolly, M. L. J. Appl. Crysrtallogr. 1983, 16, 548. (b) Connolly, M. L. Molecular

Surface Program, QCPE No. 429. 49. Marsili, M. In Physical Property Prediction in Organic Chemisty., Jochum, C.; Hicks, M. G.;

Sunkel, J., Eds.; Springer: Berlin, 1988; p 249. 50. (a) Petrongolo, C.; Tomasi, J. Int. J. Quantum Chem. Symp. 1975, 2, 181. (b) Loew, G. H.;

Berkowitz, D. S. J. Med. Chem. 1975, 18, 656. (c) Mazurek, A. D.; Weinstein, H.; Osman, R.; Topiol, S.; Ebersole, B. J. Quantum Biol. Symp. 1984, 11, 183.

51. (a) Pullman, B. Intermolecular Interactions: From Diatomics to Biopolymers, Wiley,

Chichester, UK, 1978. (b) Kaplan, I. P. Theory of Molecular Interactions, Elsevier, Amsterdam, 1968. (c) Besler, B. H.; Merz, K. M.; Kollnian, P. A. J. Comp. Chem. 1990, 11, 431.

52. (a) Carbo, R.; Leyda, L.; Arnau, M. Int. J. Quantum Chem. 1980, 17, 1185. (b) Burt, C.;

Huxley, P.; Richards, W. G. J. Comp. Chem. 1990, 11, 117.

41

Page 56: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 2

53. Politzer, P.; Murray, J. S. Molecular electrostatic potentials and chemical reactivity. In Rev. Comput. Chem., Lipkowitz, K.; Boyd, R. B., Eds. VCH: NewYork, 1998; Vol.2, pp 273–312.

54. Murray, J. S.; Politzer, P. The use of the molecular electrostatic potential in medicin

chemistry. In Quantum medicinal chemistry., Carloni, P.; Alber, F.; Mannhold, R.; Kubinyi, H.; Folkers, G., Eds. Wiley-VCH: NewYork, 2003; Vol. 17, pp 233–254.

55. Ehresmann, B.; de Groot, M. J.; Alex, A.; Clark, T. J. Chem. Inf. Comput. Sci. 2004, 43,

658–668. 56. Podlipnik, C.; Koller, J.; Croat. Chem. Acta. 1998, 71, 689–696. 57. Reynolds, C. A.; Richards, W. G.; Goodford, P. J. J Chem Soc, Perkin Trans II 1988, 551–

556. 58. Miertus, S.; Scrocco, E.; Tomasi, J. J Chem Soc, Perkin Trans 1981, 1439–1443. 59. Wong, M. W.; Frisch, M. J.; Wiberg, K. B. J. Am. Chem. Soc. 1991, 113, 4776–4782. 60. Zou, J.; Yu, Y.; Shang, Z. J Chem Soc, Perkin Trans 2001, 1439–1443. 61. Rauhut, G.; Clark, T. J. Comput. Chem. 1993, 14, 503–509. 62. Beck, B.; Rauhut, G.; Clark, T.; J. Comput. Chem.1994, 15, 1064–1073. 63. Göller, A. H.; Horn, A. H. C.; Clark, T. (unpublished). 64. Horn, A. H. C. Ph.D Thesis, Friedrich-Alexander-Universität Erlangen-Nürnberg 1994. 65. Hine, J.; Mookerjee, P. K. J. Org. Chem. 1975, 40, 292-298. 66. Eisenberg, D.; McLachlan, A. D. Nature. 1986, 319, 199-203. 67. Hawkins, G. D.; Cramer, C. J.; Truhlar, D. G, J. Phys. Chem. B 1997, 101, 7147-7157. 68. Bernd, E.; de Groot, M. J.; Clark, T. J. Chem. Inf. Model. 2005, 45, 1053-1060.

69. Jr-Hung, L.; Clark, T. An Analytical, Variable Resolution, Complete Description of Static

Molecules and Their Intermolecular Binding Properties. J. Chem. Inf. 2005, 45, 1010-1016.

70. Cai, W.; Shao, X.; Maigret, B. Protein-ligand recognition using spherical harmonic molecular surfaces: towards a fast and efficient filter for large virtual throughput screening. Journal of Molecular Graphics and Modelling 2002, 20, 313-328.

71. Max, N. L.; Getzoff, E. D. Spherical Harmonic Molecular Surfaces. IEEE Comput. Graphics Appl. 1988, 8, 42-50.

72. Duncan, B. S.; Olson, A. J. Approximation and Characterization of Molecular Surfaces; Scripps Institute: San Diego, California, 1995.

42

Page 57: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Self-consistent Reaction Field Calculations

73. Clark, T.; Alex, A.; Beck, B.; Burhardt, F.; Chandrase, J.; Gedeck, P.; Horn, A. H. C.; Hutter, M.; Martin, B.; Rauhut, G.; Sauer, W.; Schindler, T.; Steinke, T. VAMP, 9.0; Accelrys Inc.; San Diego, 2003.

74. Clark, T.; Lin, J.-H.; Horn, A. H. C. ParaSurf 1.0, Computer-Chemie-Centrum,

University of Erlangen, Erlangen, 2004. 75. CORINA 3D Structure Generator, Molecular Networks, GmbH: Erlangen, Germany, 2006. 76. Sadowski, J.; Gasteiger, J.; Klebe, G. Comparison of Automatic Three-Dimensional Model

Builders Using 639 X-Ray Structures. Journal of Chemical Information and Computational Sciences 1994, 34, 1000-1008.

77. Heiden, W.; Goetze, T.; Brickmann, J. Fast generation of molecular surfaces from 3D data

fields with an enhanced “marching cube” algorithm. J. Comput. Chem. 1993, 14, 246-250.

78. Cai, W.; Zhang, M.; Maigret, B. New approach for representation of molecular surface. J. Comput. Chem. 1998, 19, 1805-1815.

79. Tsar 3.3, Oxford Molecular Ltd.: Oxford, 2000. 80. Jiabo, Li.; Tianhai, Zhu.; Hawkins, G. D.; Winget, P.; Daniel, A. L.; Cramer, J. C.; Donald,

G. T. Extension of the platform of applicability of the SM5.42R universal solvation model. Theor. Chem. Acc. 1999, 103, 9-63.

81. Lyman, W. J.; Reehl, W. F.; Rosenblatt, D. H. Handbook of Chemical Property Estimation

Methods. American Chemical Society: Washington, DC, 1990. 82. Hansch, C.; Leo, A.; Mekapati, S. B.; kurup, A. QSAR and ADME. Biorg. Med. Chem. 2004,

12, 3391-3400. 83. Soskic, M. J. Chem. Inf. Model. 2005, 45, 930-938. 84. Hansch, C.; Leo, A.; Hoekman, D. Exploring QSAR: Hydrophobic, Electronic, and Steric

Constants. The American Chemical Society: Washington, DC, 1995. 85. Reynolds, C. H. J. Chem. Inf. Comput. Sci. 1995, 35, 738. 86. Giesen, D. J.; Chambers, C. C.; Cramer, J. C.; Truhlar, D. G. J. Phys. Chem. B 1997, 101,

2061-2069. 87. Junmei, W.; Wie, W.; Shuanghong, H.; Matthew, L.; Kollman, P. A. J. Phys. Chem. B 2001,

105, 5055-5067.

88. Thompson, J. D.; Cramer, C. J.; Truhlar, D. G. New Universal Solvation Model and Comparison of the Accuracy of the SM5.42R, SM5.43R, C.-PCM, and IEF-PCM Continuum Solvation Models for Aqueous and Organic Solvation Free Energies and for Vapor Pressures. J. Phys. Chem. A 2004, 108, 6532-6542.

89. Giesen, D. J.; Chambers, C. C.; Cramer, C. J.; Truhlar, D. G.: Solvation Model for Chloroform Based on Class IV Atomic Charges. J. Phys. Chem. B 1997, 101, 2061-2069.

43

Page 58: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 2

90. Tannor, D. J.; Marten, B.; Murphy, R.; Friesner, R. A.; Sitkoff, D.; Nicholls, A.; Honig, B.; Ringnalda, M.; Goddard, W. A. Accurate First Principles Calculation of Molecular Charge Distributions and Solvation Energies from Ab Initio Quantum Mechanics and Continuum Dielectric Theory. J. Am. Chem. Soc. 1994, 116, 11875-82.

91. Dewar, M. J. S.; Zoebisch, E. G.; Healy, E. F.; Stewart J. J. P. Development and use of quantum mechanical molecular models. 76. AM1: a new general purpose quantum mechanical molecular model. J. Am. Chem. Soc. 1985, 107, 3902-3909. Holder, A. J. AM1, Encyclopedia of Computational Chemistry; Schleyer, P. v. R., Allinger, N. L., Clark, T., Gasteiger, J., Kollman, P. A., Schaefer, H. F., III., Schreiner, P. R., Eds.; Wiley: Chichester, 1998; pp 8-11.

92. Pople, J. A.; Santry, D. P.; Segal, G. A. J. Chem. Phys. 1965, 43, S129-S135.

44

Page 59: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 3

Binned Surface-Integral Models for Predicting the Octanol-Water

Partition Coefficient

Page 60: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 3

3.1 Introduction

In science, more specifically in chemistry, the use of certain assumptions or concepts can help either to conduct a comparative study of results or to plan experiments that may be realized in the future1. Hydrophobicity, which can be defined as the ability of an organic molecule to dissolve in non-aqueous substances such as oils and fats, seems to adhere systematically to this statement. Hydrophobic molecules are generally nonpolar, and in an aqueous medium they cluster together, encouraging the creation of a nonpolar molecule-aqueous solvent interaction, which is described by both hydrophobic hydration and hydrophobic interaction2-5. Hydrophobic effects play a fundamental role in the development of many reaction mechanisms that take place in aqueous medium. For this purpose, thanks to some phenomena, such as hydrophobic interactions between the receptor and ligand during transport through a membrane, or other factors related to the pharmacokinetic properties of the molecule6,7, lipophilicity exerts considerable influence on the determination of the biological activity of ligands, the transport of bioactive molecules through biological membranes8, and the environmental fate of organic molecules9. Hydrophobicity is commonly used for describing the free energy changes occurring when a drug is moved from an aqueous phase to a lipid bilayer10. Hydrophobicity can also play a key role in the binding of the substrate to its corresponding macromolecular active site11. In molecular modeling, the determination of hydrophobicity is an important parameter for the design and development of QSAR models12. Quantitatively, lipophilicity is associated with the logarithm of the partition coefficient, logPow, of a compound in an assembly formed by the n-octanol/water system11. Because lipophilicity plays a key role in the transport of ligands, ligand binding, and the development of QSAR models necessary to predict biological activity13-21, research and the development of standard methods, effective enough for calculating the logPow, are currently far reaching and of financial interest. Thus, many methods based on different approaches have been developed in this regard, especially the use of either atomic or molecular fragment-based group additivity, and linear regression of a database of compounds with known partition constants, has resulted in the production of successful ClogP18 and AlogP14 models13-17. In the latter case, the specific values for structural equivalents are particularly determined.

The surface-integral method, commonly called surface-integral models22 (SIMs), is a

technique for obtaining the target property by an integration of the local properties on the entire surface of the molecule. It belongs to the group of QSPR models and can provide a general idea about the exact characteristics that contribute to the modeling of the physical property. In the previous chapter, we used the SIMs to obtain the local surface tension contribution to the solvation free energy. Similarly, in some previous works, solvation energies have been explicitly determined through surface integrals relatively evaluated as equivalent to volume integrals23. Local properties24 that are usually defined using semi-empirical molecular orbital (MO) theory are generally used for generating solvation models in different solvents25. These models, far from being a simple integral function of local solvation, always contain a large constant, and hence the energies of solvation cannot in any way be assimilated to the local properties. This is clearly the case for the models derived from least squares fitting, where all training data is moreover sufficiently local so that the model can give rise only to relatively small deviations from the constant. To efficiently generate a SIM for logPow, one must first define the contours and the different aspects of the local hydrophobicity, whose integral is logPow. The main objective of the present research study is directed towards the establishment of a new approach for generating SIMs for logPow, relying heavily on the true local hydrophobicity. This means defining clearly a true local

46

Page 61: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Binned SIM logPow Models

hydrophobicity, where the addition of a constant to the surface integral is not essential, and the local properties used are obtained from semi-empirical MO calculations.

The protonation state is a parameter affecting naturally the logPow, which is basically

the difference between the free energies of solvation of a compound in water and in water-saturated n-octanol26. In most cases, it is the logPow of the neutral compound that is used. In other words, the logPow is measured when the compound is in a buffered solution, and therefore un-ionized or zwitterionic. Chloroform and n-hexane27 are other solvents, which can be used in defining the lipophilic environment of a molecule. N-hexane, which is a fully polar solvent, has a tendency to be more hydrophobic than n-octanol28, which contrarily is a hydrogen bond donor and acceptor. However, given the fact that more data are available for logPow than for the partition coefficients between water and other organic phases, n-octanol was unanimously adopted as the reference solvent for lipophilicity. The logPow is widely used in predicting transmembrane transport properties, protein binding, receptor affinity, and pharmacological activity of molecules20. The reliability of predicted logPow values can cause a considerable effect on the design process because there is a strong relationship between this process and the predicted structures20. Some research groups, apart from relying on the most common approach of using structural fragments for calculating logPow, have turned to the use of molecular properties for generating logPow models17,19,20,21. Following this line of research, Klopman established a method in which the partial charges calculated with MINDO/3 are highly considered29. Bodor and co-workers, using their original BlogP model, have successfully completed an extension of the molecular properties for calculating the logPow20. Generally, determining logPow values through an experimental way is not a complicated process. The logPow can effectively be measured in the laboratory using sufficiently adapted and developed techniques30, such as correlation with chromatography retention times31 and automated titration with potentiometric measurements32, which seem to be used more than the original shake-flask method. These techniques allow the derivation of fairly reliable values of logPow, which can be used with confidence to generate logPow models. Today, various research laboratories in terms of their work efforts have succeeded in making available to the scientific community several databases for logPow, which among others are the PHYSPROP33, Beilstein34, and LOGKOW35 databases. Significant success can be observed with these databases because they allow, along with techniques such as multiple linear regression36, support vector machines37, partial least-squares38, neural networks, and ensembles thereof39, the generation, in conjunction with a variety of different descriptors, of logPow models.

In this work, an extension of the new binned SIM approach developed by Clark et al.40 with the other Hamiltonians and the solvent-excluded surface (SES)41 added to the AM1 Hamiltonian and the isodensity surface (iso)42 used originally, and introducing another local property, electronegativity, is presented. 3.2 Methods of Calculating

The LOGKOW database35 provided 37,783 logPow values, including 23,479 compounds collected from the literature. For these compounds, the values suggested by LOGKOW were controlled and treated to have an average value for those with disadvantages. Comparing the structures of these compounds to those stored in SciFinder or PubChem, yields 11,500 compounds whose structures are virtually identical to those present in the different

47

Page 62: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 3

databases, therefore, obeying the standards. Among them, all duplicated entries, those with permanent charges or unpaired electrons, compounds containing atoms whose parameters are not available in VAMP43, and all zwitterionic compounds were systematically subtracted, so that at the end 10,813 compounds (9241 compounds with rotatable bonds and 1572 rigid compounds) were kept for the work. With CORINA44 the SMILES strings of each of these compounds were converted into 3D structures. The Molecular Operating Environment (MOE)45 helped in calculating the number of rotatable bonds and the number of hydrogen bond donor/acceptor atoms. A set of 368 compounds obtained randomly from 20% of the data46, used as a test set in chapter 2, was used as an external validation set.

Geometries of the different 3D molecular structures were optimized in the gas phase

using the AM1, AM1*, MNDO, MNDO/d, PM3, and PM6 Hamiltonians, which are fully implemented in VAMP43. Starting with the previously optimized molecular geometries, the different molecular descriptors were calculated with ParaSurf1047, however, having first defined the nature and type of molecular surface, which can be the default iso42 or the SES41. The local properties obtained from the ParaSurf calculation are the molecular electrostatic potential (MEP), local ionization energy (IEL), local electron affinity (EAL), local hardness (HARD), local polarizability (POL), local electronegativity (ENEG) and the field normal to the surface (FN)48.

One of the commonly used modern techniques and whose major asset is to mitigate

effects, such as noise and outliers in the training set, is bagging. By this method, it is possible during model generation to make a reasonable estimate of the test set, which can be obtained as large as the training set49. Thereby, the bagging version of stepwise multiple linear regression50 helped for generating the models. Ninety-five percent of the critical F-value51, first calculated and subsequently used as a stopping criterion with forward and backward stepping, avoids overtraining but at the same time gives certain assurances on the significant variables included in the models. Given the results obtained, this strategy generally leads to sufficiently robust models. Seventy-five percent of the compounds from the total training set were first selected by the random selection technique, and subsequently, 50 independent bagging samples were generated for each model, so that each compound was repeatedly and successively used in the training and test sets40. The statistical performances of test and training sets are thus expressed in terms of the square of the correlation coefficient (R2), mean unsigned error (MUE) and the root mean square error (RMSE). The final formula of the equation for each model was obtained by performing an arithmetic average of the various coefficients generated from the 50 unique formulas.

3.2.1 Solvent-Excluded Surface (SES) The SES41 is a delimited portion of the molecule that is not accessible to a solvent probe sphere when the latter moves along the molecular surface41. The SES comprises two different parts that are the convex and the concave surfaces. The convex surface is the contact surface of the molecule, namely, the region of the van der Waals (vdW) surface that creates a direct contact with the solvent probe molecules. The concave surface is the reentrant part of the molecular surface. It is formed from the regions that face inward towards the solvent probe sphere, which is in contact with two or more atoms simultaneously. The SES41, as

48

Page 63: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Binned SIM logPow Models

shown in Figure 3.141, depends entirely on the types of atoms contained in the protein molecules and the nature of the corresponding solvent probe molecules.

Figure 3.1. The blue curve is the portion of the molecular structure that represents the vdW surface, and the red circle depicts the solvent probe molecule. The curves in red and black

represent the solvent-accessible surface (SAS) and the SES, respectively. Reprinted from 41. 3.3 Results

The SIM24 appraoch used in the previous chapter is based on an integration over the molecular surface of a polynomial expansion of the local properties obtained from ParaSurf (including cross terms) with the fitting of the integrals to the target. Although the use of this SIM approach provides models in which the constant is weak, very often this technique leads simultaneously to the obtaining of an equation with one important constant and a local function that is fully focused on the relatively small differences of the constant value. Thus, the transfer of this local function to a local hydrophobicity is extremely dependent on this constant, which in this case should not be significant. To achieve this, a new approach, fairly reliable and entirely different from the previous one, was used. It consists of performing the binning of the local properties and their cross-products, and then generating, for all training compounds contained in a specific set, maxima and minima whose median values are used as the outer binning thresholds. This results in an intermediate range that, divided into 10 bins of the same width, gives a set of 12 bins and 11 thresholds that along with the 21 cross-products and the seven local properties for AM1, MNDO, PM3, and PM6 or with the 15 cross products and the six local properties for AM1* and MNDO/d leads to the generation of 336 or 252 new surface-bin descriptors, respectively. Each set of descriptors obtained was utilised through a stepwise multiple linear regression to fit experimental logPow values.

The total of 10,813 compounds, with the AM1 Hamiltonian and the SES, yields a set

of descriptors that, performed with the binned SIM approach, generates a stepwise multiple linear regression model, with a performance of R2 (test) = 0.89, RMSE (test) = 0.58 logPow

units and MUE (test) = 0.43 logPow units. Figure 3.2 is a graphical representation of predicted versus measured values for the out-of-bag test set predictions obtained.

49

Page 64: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 3

Figure 3.2. Graphical representation of the measured versus predicted logPow values for the test set obtained with the AM1 Hamiltonian and the SES.

For the entire data set, 40 of the 336 descriptors are used in about 25 of the 50 bagging

equations. In this set of descriptors, MEP × EAL and EAL × FN bins appear in each of the 50 bagging equations. Based on the sum of the absolute values of the coefficients, the most important descriptors listed in decreasing order are EAL × ENEG, HARD × ENEG, EAL × FN, IEL × EAL, and IEL. The average value of the constant is 0.15.

According to Figure 3.2, there are some strong outliers in the model. The SMILES of these outliers were collected and submitted to the MOE45 for a conformational search, using the MMFF94x forcefield. Geometries obtained from MOE45 with the lowest energies were optimized and their heat of formations compared to the one obtained from the geometries used for generating the model. It appears that the protonation states used for the model building are more stable than those obtained from MOE45. The largest deviations observed for these compounds may be either related to problems with geometry optimization or particular problems with the AM1 Hamiltonian. The different protonation states of 1H-purine, 2,6,8-tris(methylsulfonyl)-, one of the outliers, are shown below in Figure 3.3.

50

Page 65: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Binned SIM logPow Models

CORINA’s protonation state MOE’s protonation state Heat of Formation = -89.16 kcal mol-1 Heat of Formation = -68.48 kcal mol-1

Figure 3.3. Protonation states of 1H-purine, 2,6,8-tris(methylsulfonyl)- generated with CORINA and MOE.

Because of their low values of heat of formation, all the CORINA single

conformations were used for generating all of the models.

3.3.1 Conformational Dependence

In QSPR, the study of the relationship between the prediction and the different conformations of a molecule when it is represented in 3D-molecular structure40 is and remains a subject that is always aborted. It has been shown previously in the solvation free energy models that the models obtained can be used to reproduce the logPow values for small molecules with a very good accuracy. In this study on predicting the logPow, it emerged, as shown in Figure 3.4, that for compounds with rotatable bonds, the prediction error depends on the number of rotatable bonds of the molecule. The numbers of rotatable bonds for flexible compounds were calculated and sorted in increasing numbers with MOE45, except for macrocyclic compounds in which MOE45 did not designate the aliphatic carbons in the cycle as rotatable.

51

Page 66: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 3

Figure 3.4. Error of prediction obtained with AM1 and the SES versus the number of rotatable bonds.

For the smaller numbers of rotatable bonds, the error increases with increasing number

of rotatable bonds. However, we see a larger variance for larger numbers of rotatable bonds; this is because we only have a few samples with a higher number of rotatable bonds, therefore, lowering the quality of sampling. This linear growth of the deviation is closely related to the increase in the uncertainty of the conformation. Some specific compounds in the logPow data set used here can help to avoid this kind of problem. These compounds without significant conformational flexibility (rigid compounds) allow models with a maximum performance to be generated. A plot of predicted versus measured values for the out-of-bag test set predictions obtained for the single-conformation with the AM1 Hamiltonian and the SES41 is shown in Figure 3.5.

52

Page 67: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Binned SIM logPow Models

Figure 3.5. Predicted values of logPow obtained with the AM1 Hamiltonian and the SES versus the measured ones for a test set of the rigid compounds.

Figure 3.5 above presents virtually no strong outliers, and there is a fairly good

correlation between predicted and experimental logPow, unlike the model obtained with the entire data set. Subtracting the flexible compounds leads to an increase in the correlation coefficient of about 0.060, and consequently a decrease in the RMSE of ≈ 0.20 logPow units.

The statistical significance for all three models generated with the AM1 Hamiltonian and the SES for training and out-of-bag test sets are listed in Table 3.1. Table 3.1. Statistical significance of all models generated with the AM1 Hamiltonian and the

SES for the single conformationModel Full set Flexible compounds Rigid compounds

MUE RMSE R2 MUE RMSE R2 MUE RMSE R2 Training

set 0.42 0.56 0.89 0.43 0.57 0.88 0.30 0.40 0.96

Test set 0.43 0.58 0.89 0.44 0.59 0.87 0.33 0.46 0.95

Using only the rigid compounds yields a RMSE of 0.46 and 0.40 for the test and

training set, respectively. With the flexible compounds, the RMSE are 0.59 for the test set and 0.57 for the training set. There is a large increase in the performance power for the models when the flexible compounds are removed from the data set. Because of the uncertainty of the conformation, the flexible compounds are responsible for an increase in the RMSE of ≈ 0.20 logPow units for the model generated with AM1 and the SES.

Combining the full data set, the AM1* Hamiltonian and the SES give a set of

descriptors that, performed simultaneously with binned SIM and stepwise multiple linear

53

Page 68: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 3

regression, yields a model where R2 (test) = 0.87, RMSE (test) = 0.61 and MUE (test) = 0.46. Figure 3.6 is a plot of predicted versus measured values for the out-of-bag test set predictions obtained.

Figure 3.6. Measured values of logPow compared to the predicted ones obtained with the AM1* Hamiltonian and the SES for the test set.

Here, 39 of the 252 descriptors are used in about 25 of the 50 bagging equations, in

contrast to AM1 where 40 of the 336 descriptors are used. For this set of descriptors, MEP × FN bin appears in each of the 50 bagging equations. Using the sum of the absolute values of the coefficients leads to obtaining, in decreasing order, the descriptors MEP × FN, IEL × FN, FN, EAL × FN, and ENEG × FN as the most important ones. The average value of the constant is 0.21. For this model, three significant outliers were observed.

The performances for all models generated with the AM1* Hamiltonian for training and out-of-bag test sets are summarized in Table 3.2.

Table 3.2. Performances of all models generated with the AM1* Hamiltonian for the single conformation

Model Full set Flexible compounds Rigid compounds MUE RMSE R2 MUE RMSE R2 MUE RMSE R2

SES

Training set

0.45 0.60 0.88 0.50 0.61 0.86 0.32 0.42 0.96

Test set 0.46 0.61 0.87 0.47 0.62 0.86 0.35 0.48 0.94

Iso Training

set 0.48 0.63 0.86 0.49 0.64 0.85 0.38 0.49 0.94

Test set 0.49 0.65 0.85 0.50 0.68 0.83 0.41 0.55 0.92

54

Page 69: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Binned SIM logPow Models

The use of the SES leads to a decrease in the RMSE and an increase in the R2 for both the training and the test set in the models generated with the full data set, the flexible compounds and the rigid compounds. For the training set, the average of the RMSE is 0.54 with the SES and 0.59 with the iso. The average of the R2 is 0.90 and 0.88 with the SES and the iso, respectively, for the same training set. For the test set, the average of the RMSE is 0.57 and 0.63 when the models are generated with the SES and the iso, respectively. The test set using the SES yields an average R2 of 0.89 and an average of 0.87 when performed with the iso.

The complete data set, the MNDO/d Hamiltonian and the SES provide a set of descriptors that used with the binned SIM allow the creation of a stepwise multiple linear regression model in which the statistical performances of R2 (test) = 0.86, RMSE (test) = 0.63 and MUE (test) = 0.47 are obtained. A graphical description of the correlation between the measured and the predicted values for the out-of-bag test set predictions obtained with MNDO/d and the SES is shown in Figure 3.7.

Figure 3.7. Correlation between the measured and the predicted values of logPow for the test set obtained with the MNDO/d Hamiltonian and the SES.

For this model, 55 of the 252 descriptors are used in about 25 of the 50 bagging

equations. Compared to the AM1*, there is a large increase in the use of the set of binned descriptors for MNDO/d. Among this set of binned descriptors, MEP × FN, MEP × ENEG, FN, and HARD × ENEG bins are present in each of the 50 bagging equations. With regards to the sum of the absolute values of the coefficients, the most important descriptors enumerated, in decreasing order, are MEP × FN, FN, HARD × FN, IEL × FN, and ENEG × FN. The average value of the constant is 21031.2 −× . This is a special situation where about five points are far away from the line of equation y = ax (a = 1).

55

Page 70: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 3

The statistical details about the performances of all models generated with the MNDO/d Hamiltonian for training and out-of-bag test sets are given in Table 3.3. Table 3.3. Statistical details on the performances of all models generated with the MNDO/d

Hamiltonian for the single conformationModel Full set Flexible compounds Rigid compounds

MUE RMSE R2 MUE RMSE R2 MUE RMSE R2

SES Training

set 0.46 0.61 0.87 0.48 0.63 0.85 0.33 0.44 0.95

Test set 0.47 0.63 0.86 0.49 0.64 0.85 0.36 0.50 0.94

Iso Training

set 0.49 0.65 0.85 0.50 0.66 0.84 0.36 0.49 0.94

Test set 0.50 0.67 0.85 0.51 0.68 0.83 0.40 0.54 0.93

As seen previously with AM1*, the use of the SES leads to a decrease in the RMSE and an increase in the R2 for both the training and the test sets of the models generated with the full data set, the flexible compounds and the rigid compounds. For the training set, the average of the RMSE is 0.56 with the SES and 0.60 with the iso. The average of the R2 is 0.89 and 0.88 with the SES and the iso, respectively, for the same training set. For the test set the average of the RMSE is 0.59 and 0.63 when the models are generated with the SES and the iso, respectively. For the test set, using the SES leads to an average R2 of 0.88 and an average of 0.87 when generating the model with the iso.

The set formed by the flexible and the rigid compounds, with the MNDO Hamiltonian and the SES, gives a set of descriptors that performed with the binned SIM approach allows the generation of a stepwise multiple linear regression model, characterized by a performance of R2 (test) = 0.88, RMSE (test) = 0.60 and MUE (test) = 0.46. The predicted values of the logPow obtained for the out-of-bag test set are plotted against the measured values in Figure 3.8.

56

Page 71: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Binned SIM logPow Models

Figure 3.8. Plot of the predicted against the measured values of logPow for the test set obtained with the MNDO Hamiltonian and the SES.

In this case, 51 of the 336 descriptors are used in about 25 of the 50 bagging

equations. For this set of binned descriptors, MEP × FN, MEP × EAL, EAL×POL, and MEP × ENEG bins are expressed in each of the 50 bagging equations. Focusing on the sum of the absolute values of the coefficient, the most important descriptors enumerated, in decreasing order, are HARD × FN, MEP × EAL, MEP × FN, MEP × ENEG, and IEL × ENEG. The average value of the constant is 21044.5 −× . About four points are outliers here.

Table 3.4 describes statistically all of the three models generated with the MNDO

Hamiltonian for training and out-of-bag test sets. Table 3.4. Statistical description of all models generated with the MNDO Hamiltonian for the

single conformationModel Full set Flexible compounds Rigid compounds

MUE RMSE R2 MUE RMSE R2 MUE RMSE R2

SES Training

set0.44 0.58 0.88 0.46 0.60 0.87 0.32 0.43 0.95

Test set 0.46 0.60 0.88 0.47 0.62 0.86 0.35 0.48 0.94

Iso Training

set0.46 0.61 0.87 0.48 0.63 0.85 0.34 0.45 0.95

Test set 0.48 0.63 0.86 0.49 0.65 0.84 0.37 0.50 0.94

With the SES there is a decrease in the RMSE and an increase in the R2 for both the training and the test sets of the models generated with the full data set and the flexible compounds. There is no change in the R2 for the rigid compounds. For the training set, the average of the RMSE is 0.54 with the SES and 0.56 with the iso. The average of the R2 is 0.90

57

Page 72: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 3

and 0.89 with the SES and the iso, respectively, for the same training set. For the test set the average of the RMSE is 0.57 and 0.59 when the models are generated with the SES and the iso, respectively. The test set performed with the SES gives an average R2 of 0.89 and an average of 0.88 when realized with the iso.

The entire data set with the PM3 Hamiltonian and the SES provides a set of descriptors that performed with the binned SIM approach lead to a stepwise multiple linear regression model, described by a performance of R2 (test) = 0.87, RMSE (test) = 0.60 and MUE (test) = 0.46. A graphical description of the correlation between the measured and the predicted values for the out-of-bag test set predictions obtained with PM3 and the SES is shown in Figure 3.9.

Figure 3.9. Comparison of measured and predicted values of logPow for the test set obtained with the PM3 Hamiltonian and the SES.

The use of the entire data set, the PM3 Hamiltonian and the SES yields a model where

45 of the 336 descriptors are used in about 25 of the 50 bagging equations. For this case, POL × FN, IEL × FN, and FN bins appear in each of the 50 bagging equations. Averaging the absolute values of the coefficient, the most important descriptors provided, in decreasing order, are IEL × FN, FN, HARD × FN, POL × FN, and EAL × ENEG. The average value of the constant is 21049.5 −× . One of these points has a particular behavior. It is a little far away from the other points, which are close together, and it is an outlier.

Table 3.5 lists all statistical values for three models generated with the PM3 Hamiltonian for training and out-of-bag test sets.

58

Page 73: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Binned SIM logPow Models

Table 3.5. Predictive powers of all models generated with the PM3 Hamiltonian for the single conformation

Model Full set Flexible compounds Rigid compounds MUE RMSE R2 MUE RMSE R2 MUE RMSE R2

SES

Training set

0.45 0.59 0.88 0.46 0.59 0.87 0.34 0.45 0.95

Test set 0.46 0.60 0.87 0.47 0.61 0.86 0.37 0.51 0.94

Iso Training

set 0.50 0.66 0.85 0.51 0.67 0.83 0.40 0.52 0.93

Test set 0.52 0.68 0.84 0.53 0.69 0.82 0.44 0.59 0.92

Generating the models with the SES leads to a decrease in the RMSE and an increase in the R2 for both the training and the test set for the full data set, the flexible compounds and the rigid compounds. For the training set, the average of the RMSE is 0.54 with the SES and 0.62 with the iso. The average of the R2 is 0.90 and 0.87 with the SES and the iso, respectively, for the same training set. For the test set, the average of the RMSE is 0.57 and 0.65 when the models are generated with the SES and the iso, respectively. Using the SES helps in obtaining an average R2 of 0.89 for the test set and an average of 0.86 when performed with the iso for the same test set.

The total data set with the PM6 Hamiltonian and the SES allows the obtainment of the set of descriptors that performed with the binned SIM approach produce a stepwise multiple linear regression model having a performance of R2 (test) = 0.86, RMSE (test) = 0.64 and MUE (test) = 0.47. A schematic representation of the regression of the measured logPow against the predicted values obtained for the out-of-bag test set is depicted in Figure 3.10.

Figure 3.10. Regression of measured logPow against predicted values obtained for the test set with the PM6 Hamiltonian and the SES.

59

Page 74: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 3

Using the set of 10,813 compounds permits a model in which 43 of the 336 descriptors are used in about 25 of the 50 bagging equations to be generated. For this set of binned descriptors, IEL × HARD and IEL × ENEG bins appear in each of the 50 bagging equations. Performing an arithmetic sum of the absolute values of the coefficient classifys the descriptors EAL × ENEG, IEL × ENEG, IEL × EAL, MEP × FN, and FN by importance in decreasing order. The average value of the constant used is -0.17. For this case, we have three points that are very far away from the others; they are significant outliers.

Table 3.6 is a summary of the performances of all models generated with the PM6 Hamiltonian for training and out-of-bag test sets. Table 3.6. Summary of the performances of all models generated with the PM6 Hamiltonian

for the single conformationModel Full set Flexible compounds Rigid compounds

MUE RMSE R2 MUE RMSE R2 MUE RMSE R2

SES Training

set 0.46 0.61 0.87 0.47 0.62 0.86 0.36 0.48 0.94

Test set 0.47 0.64 0.86 0.48 0.64 0.85 0.38 0.52 0.93

Iso Training

set 0.48 0.63 0.86 0.49 0.64 0.85 0.37 0.50 0.94

Test set 0.49 0.66 0.85 0.50 0.66 0.84 0.41 0.55 0.93

Model generation realized with the SES yields a decrease in the RMSE and an increase in the R2 for both the training and the test sets of the models generated either with the full data set or the flexible compounds. There is no change in the R2 for the rigid compounds. For the training set, the average of the RMSE is 0.57 with the SES and 0.59 with the iso. The average of the R2 is 0.89 and 0.88 with the SES and the iso, respectively, for the same training set. For the test set the average of the RMSE is 0.60 and 0.62 when the models are generated with the SES and the iso, respectively. With the test set, R2 averages of 0.88 and 0.87 are obtained for the SES and the iso, respectively.

In chemistry, as in physics, two atoms create between them an interaction that may be attractive or repulsive depending on the nature of these atoms. Thus, a hydrogen atom in the presence of an electronegative atom, such as oxygen, nitrogen, or fluorine, from another molecule or chemical group creates with the latter an attractive interaction, commonly called hydrogen bonding. An electronegative atom, whether bonded to a hydrogen atom or not, can always bond with another hydrogen, which means an eternal hydrogen bond acceptor, unlike the hydrogen atom, which, when attached to an electronegative atom, automatically binds a hydrogen donor52. Hydrogen bonding strongly affects the logPow prediction, as does the number of rotatable bonds.

In Figure 3.11, the error of prediction is related to the number of hydrogen bond donor/acceptors.

60

Page 75: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Binned SIM logPow Models

Figure 3.11. Correlation between the error of prediction obtained with AM1 and the SES and

the number of hydrogen bond donor/acceptor atoms obtained with MOE.

For the complete data set, the standard deviation rises essentially linearly with increasing hydrogen bond donor/acceptor atoms. There is a correlation between the number of hydrogen bond donor/acceptors and the standard deviation.

The SMILES strings of the LOGKOW database35 were compared to those of the data46 used as a test set in the previous chapter, stored in PubChem or ChemSpider. Forty-eight compounds found in both data sets were removed from the external validation data set. A remaining set of 320 compounds was then used for the external validation process. The performances of all 11 models generated with the full data set on the external validation set are shown in Table 3.7.

Table 3.7. Performances of all models generated with the full set on the external validation set for the single conformation

MODEL MUE RMSE R2 AM1 (SES) 0.35 0.49 0.89 AM1* (iso) 0.41 0.55 0.86

AM1* (SES) 0.40 0.55 0.86 MNDO/d (iso) 0.36 0.51 0.87

MNDO/d (SES) 0.36 0.50 0.88 MNDO (iso) 0.37 0.53 0.86

MNDO (SES) 0.36 0.51 0.88 PM3 (iso) 0.42 0.59 0.83

PM3 (SES) 0.36 0.51 0.88 PM6 (iso) 0.40 0.55 0.85

PM6 (SES) 0.41 0.57 0.85

The best model obtained with the binned SIM gives an R2 of 0.89 compared to the polynomial SIM where the best model gives an R2 of 0.84. The new binned SIM approach

61

Page 76: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 3

helps not only in describing a true local hydrophobicity or in obtaining a test set that can be as large as the training set but also in obtaining models whose statistical performances are better than those generated from the old polynomial SIM approach.

3.3.2 Comparison with Publicly Available logPow Models

The full, flexible and rigid data sets were performed with other publicly available logPow prediction tools. These were AlogPS2.153, SlogP and logPo/w available in MOE45. The performances are summarized in Table 3.8.

Table 3.8. Performances of other publicly available logPow models on our full, flexible and rigid sets for the single conformation

Model Full test (validation) set (10,813 compounds)

Test or validation set for flexible compounds (9241 compounds)

Test or validation set for rigid compounds (1572

compounds) MUE RMSE R2 MUE RMSE R2 MUE RMSE R2

AlogPs 0.30 0.43 0.94 0.30 0.44 0.93 0.27 0.40 0.96SlogP 0.55 0.76 0.80 0.56 0.76 0.79 0.52 0.72 0.88

logP_o/w 0.56 0.82 0.79 0.58 0.84 0.77 0.45 0.66 0.88 There is a striking difference in the performances of the publicly available models for

the AlogPS2.1, logP_o/w and SlogP from the MOE45. The logP_o/w and SlogP obtained from the MOE45 give similar R2 values, but the AlogPs give very good R2 values. One disadvantage of the AlogPS2.1 model is that for our data set of 10,813 compounds, 122 compounds were not calculated because their SMILES were not accepted.

3.3.3 Variable Importance

The variable importance gives an idea about the effect or particular impact of a descriptor in a specific model. It is obtained by the arithmetic addition of the absolute values of the coefficients of the descriptors ascertained from a given property or a cross-product of two properties.

This importance measured for a model obtained from the full data set, with the AM1

Hamiltonian and the SES, is shown in Figure 3.12.

62

Page 77: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Binned SIM logPow Models

Figure 3.12. Importance of the different basic descriptors as the arithmetic addition of the absolute values of the corresponding bin coefficients obtained with the AM1 Hamiltonian and

the SES.

The EAL, ENEG, IEL, HARD, and FN as well as the cross-products between them are the most important descriptors. No single local property dominates, but the MEP does not play a significant role.

Figure 3.13 is a graphical representation of the variable importance obtained for a model generated with the full data set, the AM1* Hamiltonian and the SES.

Figure 3.13. Quantitative significance of the role played by the different basic descriptors obtained as the arithmetic addition of the absolute values of the corresponding bin coefficients

for the AM1* Hamiltonian and the SES.

63

Page 78: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 3

The MEP, FN, IEL, EAL, and ENEG as well as the cross-products between them are the most important descriptors. No single local property dominates, but the local hardness does not play a significant role.

3.3.4 Variable Dependence The dependence of the predicted AM1 and AM1* logPow values was investigated by

systematically changing one of the input descriptors, the FN. Compounds selected for this purpose are listed in Tables 3.9 and 3.10. They are flexible compounds with a range of rotatable bonds varying from one to four. Changing the calculated value of the integrated absolute FN over the surface (|F(N)|), the integrated FN over the surface for all negative values (F(N – ve)), and the integrated FN over the surface for all positive values (F(N + ve)) has a dominant influence in predicting logPow. Table 3.9 shows the dependence of the calculated logPow on the F(N – ve) obtained with the AM1 Hamiltonian and the SES41.

Table 3.9. Dependence of logPow prediction on the F(N – ve) for AM1 and the SESCompound Ex

logP

F(N–ve) 1 (kcal.

Angstrom mol 1− )

logP1

F(N – ve) 2 (kcal.Angstro

m mol 1− )

logP2

Number of

rotatable bonds

Benzonitrile, 2,6-dimethyl-, N-oxide

2.74 -228.5 9.84 -1319 7.32 1

Dibenzo[b,z][1,4,7,10,13,16,19,

22,25,28,31,34,37,40,43,46] Hexadecaoxacyclo-

octatetracontin, 6,7,9,10,12,13,15,16,18,19,21, 22,24,25,32,33,35,36,38,39,41,

42,44,45,47,48,50,51-octacosahydro-

0.52 -731.2 5.15 -7172 3.95

Dibenzo[b,q][1,4,7,10,13,16, 19,22,25,28]decaoxacyclo-

triacontin, 2,20-bis(1,1-dimethylethyl)-6,7,9,10,12,13,15,16,23,24,

26,27, 29,30,32,33-hexadecahydro-

3.32 -460.0 7.45 -4779 6.66

Aziridine, 1,1',1""-phosphinothioylidynetris-

0.53 -1432 -4.30 -3142 -2.37 3

1H-purine, 2,6,8-tris(methylsulfonyl)-

3.58 -561.3 -0.81 -5476 -0.60 3

Sulfur, pentafluorophenyl- 3.36 -407.3 3.80 -1343 3.63 1Acetic acid, 3-

(pentafluorothio)phenoxy- 2.78 -475.8 3.40 -2554 3.02 4

64

Page 79: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Binned SIM logPow Models

Table 3.10 gives the dependence of the predicted logPow on the F(N – ve) when the AM1* Hamiltonian and the SES41 are used.

Table 3.10. Dependence of logPow prediction on the F(N – ve) for AM1* and the SESCompound Exlog

P F(N–ve) 1

(kcal. Angstrom mol 1− )

logP1

F(N – ve) 2 (kcal.Angstro

m mol 1− )

logP2

Number of

rotatable bonds

Benzonitrile, 2,6-dimethyl-, N-oxide

2.74 -237.9 3.13 -1320 2.29 1

Dibenzo[b,z][1,4,7,10,13,16,19,

22,25,28,31,34,37,40,43,46] hexadecaoxacyclo-

octatetracontin, 6,7,9,10,12,13,15,16,18,19,21, 22,24,25,32,33,35,36,38,39,41

, 42,44,45,47,48,50,51-

octacosahydro-

0.52 -739.9 5.88 -7282 4.41

Dibenzo[b,q][1,4,7,10,13,16, 19,22,25,28]decaoxacyclo-

triacontin, 2,20-bis(1,1-dimethylethyl)-6,7,9,10,12,13,15,16,23,24,

26,27, 29,30,32,33-hexadecahydro-

3.32 -472.3 7.93 -4788 7.01

Aziridine, 1,1',1""-phosphinothioylidynetris-

0.53 -1042 -1.40 -2315 -1.12 3

1H-purine, 2,6,8-tris(methylsulfonyl)-

3.58 -470.6 0.88 -4886 -0.31 3

Sulfur, pentafluorophenyl- 3.36 -414.4 6.62 -1312 1.98 1Acetic acid, 3-

(pentafluorothio)phenoxy- 2.78 -470 7.24 -2514 3.24 4

It has been clearly demonstrated that a direct link exists between the change of a

parameter and the logPow prediction. Carbon substitution, in general, increases hydrophobicity, while heteroatoms decrease its value; decreasing the value of the F(N – ve) increases the logPow predictability, reducing the values of the errors.

In the previous tables, compounds without a number designation for rotatable bonds are macrocycles for which the MOE45 did not assign a number. Figures 3.14 and 3.15 are some of these macrocycles.

65

Page 80: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 3

(a) (b)

Figure 3.14. (a) Sulfur, pentafluorophenyl- (b) Aziridine, 1,1',1""-phosphinothioylidynetris-.

(a) (b) Figure 3.15. (a) Acetic acid, 3-(pentafluorothio)phenoxy- (b) Benzonitrile, 2,6-dimethyl-,

N-oxide.

Hydrogen bonding is a key parameter for the assessment of possible three-dimensional structures of proteins and nucleic bases. Very often in these macromolecules, there may be bonding between the different parts of the same molecule. This internal bond allows the molecule to adopt a specific conformation, favorable to the biochemical or physiological role of the molecule. Therefore, the FN, which appears as one of the most important descriptors, is closely related to hydrogen bonding and in consequence can specifically affect the logPow prediction. In Figure 3.16, the |F(N)| is related to the number of hydrogen bond donor/acceptors.

66

Page 81: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Binned SIM logPow Models

Figure 3.16. Average value of the |F(N)| obtained with AM1 and the SES versus the number

of hydrogen bond donor/acceptor atoms.

For the full data set, a strong relationship is manifested between the value of the |F(N)| and the number of hydrogen bond donor/acceptor atoms. These two properties are extremely related, and increasing the number of hydrogen bond donor/acceptor atoms goes hand in hand with that of the |F(N)|.

3.4 Discussion

The main objective of this study was to generate logPow models based mainly on the true local hydrophobicity, which can be used to predict accurately the logPow values for a set composed of structurally diverse compounds. From our investigation, it was established that some physical properties, such as the number of rotatable bonds and the number of hydrogen bond donor/acceptor atoms (Figures 3.4 and 3.11), are parameters that strongly affect the logPow prediction. LogPow values for compounds with a large number of rotatable bonds or a large number of hydrogen bond donor/acceptor atoms are poorly estimated. For the special case of the AM1 Hamiltonian and the SES41 (solvex), the R2 of the model obtained from the flexible compounds is about eight times lower than that of the model generated with the rigid compounds, suggesting the possibility of a particular uncertainty about the different conformations of the flexible compounds. Generally, models generated from the rigid compounds, which have restricted conformations, present very good statistical performances, unlike those obtained with the flexible compounds. This hypothetical situation can be due to the fact that for the flexible compounds the real conformations are not known, and the single conformation used was derived from a 2D 3D conversion with CORINA44. The use of an average conformation for these flexible compounds, or other statistical approaches, may be a typical strategy that can help in improving the predictive power of the models generated with the flexible compounds. Another important aspect concerning the relationship between the number of hydrogen bond donor/acceptors of some particular compounds and the RMSE, both of which increase linearly relative to one another (Figure 3.11), can be the specificity of

67

Page 82: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 3

the octanol, which is already a hydrogen bond donor/acceptor compound28. This unfavorable similarity20 between octanol and compounds containing a large number of hydrogen bond donor/acceptors leads to the creation of a repulsive interaction force between these compounds and octanol. Therefore, compounds with a high number of hydrogen bond donor/acceptor atoms tend to be less hydrophobic.

As seen for the solvation models where the histogram of the graphical representation of the RMSD is dominated by the high percentages of AM1* and MNDO/d compared to AM1 and PM3, it does not seem to be the same situation for the logPow models. Figure 3.17 gives a graphical representation of the RMSE for the logPow models generated with the total data set for AM1, AM1*, MNDO/d, MNDO, PM3, and PM6.

Figure 3.17. Hamiltonian versus RMSE for the total data set.

According to the histogram above, the higher percentages of the RMSE, in decreasing order, are those for PM3(iso), MNDO/d(iso), PM6(iso), AM1*(iso), PM6(solvex), MNDO(iso), MNDO/d(solvex), AM1*(solvex), MNDO(solvex), PM3(solvex), and finally AM1(solvex). The use of the FN seems to compensate for the lack of the POL for AM1* and MNDO/d that was previously strongly manifested for the solvation models, thus playing a significant role in predictions with these Hamiltonians. As a reference to the comments made on Figures 3.6 and 3.7, MEP × FN bin appears in all of the 50 bagging equations and is the first on the list of the most important descriptors that are selected in regard to the sum of the absolute values of the coefficient for AM1* and MNDO/d. FN also appears in a single form and in all cross terms for these two Hamiltonians. The effect of the molecular surface on the logPow prediction is shown. All the models generated with the SES have RMSE frequencies lower than those obtained with the iso for the same Hamiltonian. The binned SIM approach is entirely a surface-dependent method; the use of the SES always increases the predictive power of the model generated, thus decreasing the value of the RMSE.

Among the Hamiltonians containing the POL, AM1 with 40 of the 336 descriptors

used (11.90%) seems to be the one using a fewer number of descriptors in 25 of the 50

68

Page 83: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Binned SIM logPow Models

bagging equations, contrary to MNDO, PM3, and PM6 with 51 (15.18), 45 (13.39%), and 43 (12.80%), respectively. For MNDO/d and AM1*, 55 (21.83%) and 39 (15.48%) binned descriptors are used, respectively, of the 252 descriptors when the model is generated with the full data set and the SES. The major difference observed in the use of the different set of binned descriptors can be due to the different parameterization approaches used for these Hamiltonians.

The constants for the models obtained with the SES and our different Hamiltonians are

0.15, 0.21, 0.023, 0.054, 0.055, and -0.17. All of these values are very close to zero, suggesting an equal probability of distribution in water and octanol, when the compound is presumed to have no surface. This supports the hyphothesis that the function that is closely related to a true local hydrophobicity can effectively be integrated into a SIM, leading to the obtainment of models representing a true local hydrophobicity.

Universal logP models were applied to the total data set of the LOGKOW database35, for comparison. It appears that, except for AlogPs that with an R2 of 0.94 perform better than the binned SIM logPow developed, SlogP and logPo/w with R2 values of 0.80 and 0.79, respectively, are worse compared to the performances of the binned SIM logPow models ranging from 0.84 to 0.89. The great difference observed in the performance of AlogPs compared to our models may be due to the 122 molecules that have not been taken into account in the evaluation of the performance of this model. However, the largest absolute error is lower for our models, confirming the robustness of our binned SIMs. One common difficulty here is that it is not possible to know if some compounds present in the LOGKOW database35 are also present in the data sets used to generate the logP_o/w, SlogP, and AlogPs models. Therefore, the comparison made here cannot be considered a true comparison.

The compounds used for the model generation are completely different from those used in the validation set, so it is possible to say that the validation made for the models obtained with the binned SIM approach is a real validation for these models.

The FN as one of the most important descriptors was analyzed in order to detect a correlation with the conformation of the molecule. A close relationship between the parameters, FN and the number of hydrogen bond donor/acceptor atoms, was observed as shown in Figure 3.16, suggesting the hypothesis that one parameter is sufficient for influencing the logPow prediction. The FN within a molecule depends strongly on the molecule’s conformation.

For the models generated, it appears that some compounds have specific behaviors

with respect to some Hamiltonians, including, among others, a marked remoteness from the other points of the curve. These behaviors may be due to certain incompatibilities between the molecule and some parameters related either to VAMP43 or to some characteristics or properties specific to the molecule.

1H-purine, 2,6,8-tris(methylsulfonyl)- (Figure 3.18) is an outlier for all the Hamiltonians. After the optimization of the geometry, the two CH3-S-O2 groups are no longer bonded to the central atom group as shown below. This compound has a problem with geometry optimization in VAMP43 and therefore its logPow value is always poorly reproduced. This problem can be solved by doing a single point calculation instead of a full optimization.

69

Page 84: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 3

Before optimization After optimization

Figure 3.18. 1H-purine, 2,6,8-tris(methylsulfonyl)-.

1H-imidazole, 2,4,5-triiodo- (Figure 3.19a) is the strongest outlier for the AM1 Hamiltonian when the descriptors are calculated with the SES. This compound contains three iodine atoms. Because of its lack of polarity, iodine, which is non-polar, has a tendency to be less soluble in octanol and water, which are polar solvents. This can also be due to the possible interaction between the NH group and the iodine atom.

Phenol, 2,6-bis(1-methylpropyl)-4-nitro-, [S-(R*,R*)]- (Figure 3.19b) is the strongest outlier when the geometry is optimized with the PM6 Hamiltonian and the descriptors calculated with the SES. For this compound, the nitro group is attached to the benzene in the para position, and because of its strong attraction for electrons, it will delocalize -electrons of the ring to satisfy its charge deficiency. Moreover, this para position of the nitro group creates the presence of two adjacent carbons that would have positive charges, and this would lead to an undesirable situation. These unwanted effects seem to be strongly manifested with the PM6 Hamiltonian, and all of these factors destabilize the compound. Therefore, its logPow would be poorly reproduced.

70

Page 85: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Binned SIM logPow Models

(a) (b) Figure 3.19. (a) 1H-imidazole, 2,4,5-triiodo- (b) Phenol, 2,6-bis(1-methylpropyl)-4-nitro-

, [S-(R*,R*)]-.

Methane, tetrabromo- (Figure 3.20a) is an outlier for the PM6 Hamiltonian. This can be either due to the fact that this compound has an optimization problem with PM6 or because of the parameterization process of the atomic parameters of bromine for the PM6 Hamiltonian.

Dibenzo[b,z][1,4,7,10,13,16,19,22,25,28,31,34,37,40,43,46]hexadecaoxacyclooctatetracontin, 6,7,9,10,12,13,15,16,18,19,21,22,24,25,32,33,35,36,38,39,41,42,44,45,47,48,50,51-octacosahydro- (Figure 3.20b), which is a macrocycle, is an outlier for all the Hamiltonians because of its flexibility. Also, its logPow value is measured in some conditions that are not defined.

(a) (b)

Figure 3.20. (a) Methane, tetrabromo- (b) Dibenzo[b,z][1,4,7,10,13,16,19,22,25,28,31,34,37,40,43,46]hexadecaoxacyclooctatetracont

in, 6,7,9,10,12,13,15,16,18,19,21,22,24,25,32,33,35,36,38,39,41,42,44,45,47,48,50,51-octacosahydro-.

The macrcocycle 1,4,7,10,13,16,19,22,25,28,31-benzundecaoxacyclotritriacontin,

2,3,5,6,8,9,11,12,14,15,17,18,20,21,23,24,26,27,29,30-eicosahydro- (Figure 3.21a) is the

71

Page 86: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 3

strongest outlier for MNDO and MNDO/d. This compound probably has an optimization problem with the MNDO and MNDO/d Hamiltonians.

Dibenzo[b,q][1,4,7,10,13,16,19,22,25,28]decaoxacyclotriacontin, 2,20-bis(1,1-

dimethylethyl)-6,7,9,10,12,13,15,16,23,24,26,27,29,30,32,33-hexadecahydro- (Figure 3.21b) is also a macrocyle. It is the strongest outlier for all the Hamiltonians because of its flexibility and probably its experimental logPow value, which is obtained in some conditions that are not defined.

(a) (b)

Figure 3.21. (a) 1,4,7,10,13,16,19,22,25,28,31-Benzundecaoxacyclotritriacontin, 2,3,5,6,8,9,11,12,14,15,17,18,20,21,23,24,26,27,29,30-eicosahydro-

(b) Dibenzo[b,q][1,4,7,10,13,16,19,22,25,28]decaoxacyclotriacontin, 2,20-bis(1,1-dimethylethyl)-6,7,9,10,12,13,15,16,23,24,26,27,29,30,32,33-hexadecahydro-.

For AM1* 1,3,5-triazine, 2,4,6-tris(trichloromethyl)- (Figure 3.22a) is an outlier

probably because of the presence of three CCl3 groups, which possibly destabilized the compound during the optimization.

For PM6 1,3-benzenedicarboxamide, -5-(acetylmethylamino)-N,N'-bis(2,3-dihydroxypropyl)-2,4,6-triiodo- (Figure 3.22b) is an outlier. This compound contains three iodine atoms and probably has an optimization problem with the PM6 Hamiltonian. Iodine is basically the most electropositive halogen, and in the presence of a polar solvent, it tends to form with the latter a charge-transfer complex. The presence of the high electron density on the iodine atom can also be responsible for the poor reproducibility of the logPow of compounds containing iodine. Another important aspect here can be, as mentioned previously, the possible interaction between the iodine atom and a NH group.

72

Page 87: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Binned SIM logPow Models

(a) (b)

Figure 3.22. (a) 1,3,5-Triazine, 2,4,6-tris(trichloromethyl)- (b) 1,3-Benzenedicarboxamide, 5-(acetylmethylamino)-N,N'-bis(2,3-dihydroxypropyl)-2,4,6-

triiodo-.

1H-imidazole, 2-nitro-1-(2,3,5-tri-O-benzoyl-á-D-ribofuranosyl)- (Figure 3.23) is an outlier for all the Hamiltonians, because of its flexibility and its experimental logPow value, which is obtained in some undefined conditions.

Figure 3.23. 1H-imidazole, 2-nitro-1-(2,3,5-tri-O-benzoyl-á-D-ribofuranosyl)-.

Hydrophobic surfaces obtained from the MOE45 for 1,3,5-triazine, 2,4,6-tris(trichloromethyl)- and 1H-purine, 2,6,8-tris(methylsulfonyl)- and optimized with AM1* and the descriptors calculated with the iso are shown below (Figure 3.24):

73

Page 88: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 3

1,3,5-Triazine, 2,4,6-tris(trichloromethyl)- 1H-purine, 2,6,8-tris(methylsulfonyl)- logP(exp): 3.75 logP(exp): 3.58 logP(calc): 8.66 logP(calc): 0.32

Figure 3.24. Schematic view of the contribution of local surface areas for 1,3,5-triazine, 2,4,6-tris(trichloromethyl)- and 1H-purine, 2,6,8-tris(methylsulfonyl)-. The color scheme is green for the hydrophobic surface, violet for the hydrophilic surface and white for neutral.

3.5 Conclusions

Here, some robust logPow models for predicting the octanol-water partition coefficient for a set of very large compounds were developed, based on the use of 336 or 252 new surface-bin descriptors generated completely from the binning of the local properties and their cross-products for AM1, PM3, MNDO, and PM6, or AM1* and MNDO/d Hamiltonians. The descriptors used to capture the chemical information were calculated either with the iso42 or the SES41. From the different analyses made, it appears that all the models developed are surface-dependent, with the use of the SES41 always predicting well compared to the iso42. This “surface” factor mostly governs the approachability of the solvent and the extent of interaction with the solvent. The models obtained from the new binned SIM approach and presented here predict well compared to the models obtained from the polynomial SIM approach. Therefore, these models seem to be most suitable for many QSAR/QSPR applications, such as the prediction of protein binding, receptor affinity, or pharmacological activity of compounds. The FN plays a key role in predicting with AM1* and MNDO/d Hamiltonians, and seems to be the most appropriate descriptor that can compensate for the lack of the POL, which is not implemented in ParaSurf for these Hamiltonians. Compared to the other Hamiltonians, AM1 calculated electrostatic potential-derived atomic charges agree better. This important aspect is probably the one responsible for AM1 predicting logPow better than the other Hamiltonians. The best performance was obtained with the AM1 Hamiltonian and the SES41, with an R2 of 0.89. Generally, the partition coefficient describes the distribution, which can be disproportionate for a molecule when the latter is immersed in a solvent composed of two phases, one aqueous and the other organic. If the organic phase is octanol, the partition coefficient reflects the hydrophobicity of the molecule. When the compound is in contact with the aqueous phase of the solvent, a repulsive force is created between the non-polar molecule and the aqueous solvent. This interaction describes the

74

Page 89: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Binned SIM logPow Models

hydrophobic effect or hydrophobic hydration2-5. For application in medicinal chemistry, the study of this hydrophobic effect is of paramount importance for the developement of drugs, because it predicts the possible interactions that can occur between nonpolar regions of drugs and their receiving environments. Through this research project a new approach that can help in the relative estimation of lipophilicity, necessary to carry out QSAR studies, has been developed. This technique presented can be assimilated to a purely thermodynamic method with a sufficiently expanded scope, including, among others, all neutral molecules that have their atomic parameters implemented in VAMP43. Thus, in contrast to fragment and molecular property-based approaches, it is now possible to estimate logPow values for some compounds without resorting to their experimental values.

75

Page 90: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 3

3.6 References 1. Jäger, R.; Schmidt, F.; Schilling, B.; Brickmann, J. Localization and quantification of

hydrophobicity: The molecular free energy density (MolFESD) concept and its application to sweetness recognition. Journal of Computer-Aided Molecular Design. 2000, 14, 631-646.

2. Glokzij, W.; Engberts, J. B. F. N. Hydrophobe Effeckte-Ansichten und Tatsachen. Angew. Chem. Int. Ed. Engl. 1993, 105, 1610-1648.

3. Tanford, C. The Hydrophobic Effect: Formation of Micelles and Biological Membranes; Wiley: New York, 1973.

4. Creighton, E. Protein-Structures and Molecular Properties; Freeman: New York, 1993.

5. Abraham, D. J.; Kellog, G. E. 3D QSAR in Drug Design: Theory, Methods and Applications;

Kubinyi, H., Ed.; Escom: Leiden, the Netherlands, 1993; p 506.

6. Hansch, C.; Leo, A. J. Substituent Constants for Correlation Analysis in Chemistry and Biology; Wiley: New York, 1979.

7. Martin, Y. C.; Marcel Dekker. Quantitative Drug Design: A Critical Introduction. Inc., San

Diego, 1978. 8. Hansch, C.; Leo; A. Exploring QSAR. Fundamentals and Application in Chemistry and

Biology. American Chemical Society: Washington, DC, 1995. 9. Lyman, W. J.; Reehl, W. F.; Rosenblatt, D. H. Handbook of Chemical Property Estimation

Methods. American Chemical Society: Washington, DC, 1990. 10. Hansch, C.; Dunn, W. J., III J. Pharm. Sci. 1972, 61, 1-19.

11. Essex, J. W.; Reynolds, C. A.; Graham, W. R. Theoretical Determination of Partition

Coefficients. J. Am. Chem. Soc. 1992, 114, 3634-3639.

12. Hansch, C.; Leo, A.; Mekapati, S. B.; Kurup, A. QSAR and ADME. Biorg. Med. Chem. 2004, 12, 3391-3400.

13. Leo, A. Calculating Log octP from Structure. Chem. Rev. 1993, 30, 1283-1306. 14. Ghose, A. K.; Crippen, G. M. Atomic Physicochemical Parameters for Three-Dimensional

Structure-Directed Quantitative Structure-Activity Relationships I. Partition Coefficients as a Measure of Hydrophobicity. J. Comput. Chem. 1986, 7, 565-577.

15. Klopman, G.; Li, J.-Y.; Wang, S.; Dimayuga, M. Computer Automated log P Calculations

Based on an Extended Group Contribution Approach. J. Chem. Inf. Comput. Sci. 1994, 34, 752-781 and references therein.

16. Rekker, R. F.; Mannhold, R. Calculation of Drug Lipophilicity, VCH: New York, 1992. 17. Klopman, G.; Iroff, L. Calculation of Partition Coefficients by the Charge Density Method. J.

Comput. Chem. 1981, 2, 157-160. 18. Leo, A. J. ClogP; Daylight Chemical Information Systems: Irvine, CA, 1991.

76

Page 91: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Binned SIM logPow Models

19. Klopman, G.; Namboodiri, K.; Schochet, M. Simple Method of Computing the Partition Coefficient. J. Comput. Chem. 1985, 6, 28-38.

20. Bodor, N.; Gabanyi, Z.; Wong, C.-K. A New Method for the Estimation of Partition

Coefficient. J. Am. Chem. Soc. 1989, 111, 3783-3786. 21. Waller, C. L. A Three-Dimensional Technique for the Calculation of Octanol-Water Partition

Coefficient. Quant. Struct.-Act. Relat. 1994, 13, 172-176. 22. Pixner, P.; Heiden, W.; Merx, H.; Moeckel, G.; Moeller, A.; Brickmann, J. Empirical Method

for the Quantification and Localization of Molecular Hydrophobicity. J. Chem. Inf. Comp. Sci. 1994, 34, 1309-1319.

23. Abraham, R. J.; Hudson, B. D.; Kermode, M. W.; Nines, J. R. A General Calculation of

Molecular Solvation Energies. J. Chem. Soc. Faraday Trans. 1988, 84, 1911-1917.

24. Ehresmann, B.; de Groot, M. J.; Alex, A.; Clark, T. New Molecular Descriptors Based on Local Properties at the Molecular Surface and a Boiling-Point Model Derived from Them. J. Chem. Inf. Comp. Sci. 2004, 43, 658-668.

25. Ehresmann, B.; de Groot, M. J.; Clark, T. Surface-Integral QSPR Models: Local Energy

Properties. J. Chem. Inf. Model. 2005, 45, 1053-1060. 26. Sangster, J. Octanol-Water Partition Coefficients: Fundamentals and Physical Chemistry,

John Wiley & Sons Ltd: Chichester, 1997; Vol.2. 27. (a) Leo, A.; Hansch, C.; Elkins, D. Partition Coefficients and their uses. Chem. Rev. 1971, 71

(6), 525-616. (b) Leahy, D. E.; Taylor, P. J.; Wait, A. R.; Model Solvent Systems for QSAR Part, I. Propylene Glycol Dipelargonate (PGDP). A new Standard Solvent for use in Partition Coefficient Determination. Quant. Struct.-Act. Relat. 1989, 8 (1), 17-31.

28. Schulte, J.; Dürr, J.; Ritter, S.; Hauthal, W. H.; Quitzsch, K.; Maurer, G. Partition Coefficients

for Environmentally Important, Multifunctional Organic Compounds in Hexane + Water. J. Chem. Eng. Data 1998, 43 (1), 69-73.

29. Bingham, R. C.; Dewar, M. J. S.; Lo, D. H. Ground States of Molecules. XXV. MINDO/3. An

Improved Version of the MINDO Semiempirical SCF-MO Method. J. Am. Chem. Soc. 1975, 97, 1285-1293.

30. Dearden, J. C.; Bresnen, G. M. The Measurement of Partition Coefficients. QSAR Com. Sci.

2006, 7 (3), 133-144. 31. Valko, K. Application of high-performance liquid chromatography based measurements of

lipophilicity to model biological distribution. J. Chromatogr. A 2004, 1037 (1-2), 299-310. 32. Takacs-Novak, K.; Avdeef, A. Interlaboratory study of log P determination by shake-flask and

potentiometric methods. J. Pharm. Biomed. Anal. 1996, 14 (11), 1405-1413. 33. The physical properties database (PHYSPROP). Syracuse research corporation. 34. CrossFire Beilstein, Elsevier: Frankfurt, 2009; Vol. 7.1.

77

Page 92: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 3

35. Sangster, J. LOGKOW -A databank of evaluated octanol-water partition coefficients (Log P). In Sangster Research Laboratories: Montreal, Quebec, accessed 11/23/2008.

36. Nys, G. G.; Rekker, R. F. The concept of hydrophobic fragmental constants (f-values). II.

Extension of its applicability to the calculation of lipophilicities of aromatic and hetero-aromatic structures. Chim. Ther. 1974, 9, 361-374.

37. Hughes, L. D.; Palmer, D. S.; Nigsch, F.; Mitchell, J. B. Why are some properties more

difficult to predict than others? A study of QSPR models of solubility, melting point, and Log P. J. Chem. Inf. Model. 2008, 48 (1), 220-232.

38. Liu, R.; Zhou, D. Using Molecular Fingerprint as Descriptors in the QSPR study of

Lipophilicity. J. Chem. Inf. Model. 2008, 48 (3), 542-549. 39. (a) Breindl, A.; Beck, B.; Clark, T.; Glen, R. C. Prediction of the n-octanol/Water Partition

Coefficient, logP, Using a Combination of Semiempirical MO-Calculations and a Neural Network. J. Mol. Model. 1997, 3, 142-155. (b) Tetko, I. V.; Tanchuk, V. Y.; Villa, A. E. Prediction of n-octanol/water partition coefficients from PHYSPROP database using artificial neural networks and E-state indices. J. Chem. Inf. Comp. Sci. 2001, 41, 1407-1421.

40. Kramer, C.; Beck, B.; Clark, T. A Surface-Integral Model for Log Pow . J. Chem. Inf. Model.

2010, 50 (3), 429-436.

41. Pan, Q.; Tai, X. –C. Model the Solvent-Excluded Surface of 3D Protein Molecular Structures Using Geometric PDE-Based Level-Set Method. Commun. Comput. Phys. 2009, 6, 777-792.

42. Meyer, A. Y. The size of molecules. Chem. Soc. Rev. 1985, 15, 449-475.

43. Clark, T.; Alex, A.; Beck, A.; Burkhardt, F.; Chandrasekhar, J.; Gedeck, P.; Horn, A. H. C.;

Hutter, M.; Martin, B.; Rauhut, G.; Sauer, W.; Schindler, T.; Steinke, T. VAMP 8.2; accelrys Inc.: Erlangen: San Diego, USA, 2002.

44. CORINA 3.4; Molecular Networks Inc: Erlangen, Germany, 2006. 45. Labute, P. Molecular Operating Environment, 2008. 10; Chemical Computing Group:

Montreal, Quebec, Canada, 2008.

46. Hansch, C.; Leo, A.; Hoekman, D. Exploring QSAR: Hydrophobic, Electronic, and Steric Constants. The American Chemical Society: Washington, DC, 1995.

47. ParaSurf10, CEPOS InSilico Ltd.: Erlangen, Germany, 2010. 48. Clark, T.; Byler, K. G.; de Groot, M. J., Biological Communication via Molecular Surfaces. In

Molecular Interactions Bringing Chemistry to life; Proceedings of the International Beilstein Workshop, Bozen, Italy, May 15-19, 2006 (Logos Verlag:), Berlin, 2008; pp 129-146.

49. Polikar, R. Ensemble based systems in decision making. IEEE Circ. Sys. Mag. 2006, 03/06,

21-45. 50. Efroymson, M. A. Multiple regression analysis. In Mathematical Methods for Digital

Computers., Ralston, A.; Milf, H. A., Eds. Wiley: New York, 1960; Vol. 1, pp 191-203.

78

Page 93: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Binned SIM logPow Models

51. Kramer, C.; Tautermann, C. S.; Livingstone, D. J.; Salt, D. W.; Whitley, D. C.; Beck, B.; Clark, T. Sharpening the Toolbox of Computational Chemistry: A New Approximation of Critical F-Values for Multiple Linear Regression. J. Chem. Inf. Model. 2009, 49 (1), 28-34.

52. Campbell, N. A.; Brad, W.; Robin, J. H. Biology: Exploring Life. Boston, Massachusetts:

Pearson Prentics Hall, 2006.

53. www.vcclab.org

79

Page 94: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

80

Page 95: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 4

Comparative Study of two Classification Algorithms for the

Prediction of Drug-Induced Phospholipidosis

Page 96: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 4

4.1 Introduction

In daily life, some people after taking a drug are victims of itching of the eyes or skin, which are, in most cases, the manifestation of an allergy to the drug consumed. Likewise, the presence of some cationic amphiphilic drugs in the human body can cause side effects commonly referred to as phospholipidosis1 (PPL). It manifests itself physically by a markedly extensive accumulation of phospholipids in lamellar and concentric forms within cells of the body. Typically, in the cellular environment, many processes occur that regulate the life cycle of the entire cell, and this metabolism can undergo significant changes under the influence of phospholipids or enzymes. Thus, any chemical, in order to acquire all properties enabling it to ensure proper biological activity, should undergo further important intermediate steps of evaluation, which include a considerable number of tests, such as pharmacodynamics, toxicity, pharmacokinetics, metabolism, excretion and mutagenicity. However, nowadays the design of sufficient and effective drugs, capable of producing good biological activity, remains a major problem and a challenge.

Our goal in this work is the generation of new models for predicting drugs that induce PPL, using two machine learning (ML) techniques.

PPL, as shown in Figure 4.1 below1, arises from the crowding of phospholipids inside

living cells, followed by the generation of concentric lamellar bodies, also called inclusion bodies or lysosomal myeloid bodies2.

Figure 4.1. Lamelar inclusion bodies as depicted by electron microscopy of cells. Adapted from 1.

In most cases, the lungs, liver, eyes, kidneys, cornea, and nervous and lymphatic systems are the body's organs often affected by this pathological manifestation. The event or the onset of PPL is a phenomenon that is largely reversible. Related to the amount of drugs released in the body3, it occurs only if a sufficiently high dosage is administered and disappears soon after systematic metabolism.

The beginning of drug discovery was strongly influenced by the high-speed physico-chemical method used for determining PPL. This assay was essentially based on a quantitative analysis of the drug-phospholipid complex formed4. Based fundamentally on the principles of structure-activity relationship (SAR), the execution speed of the screening method for PPL may increase considerably if the process is realized with the use of filters for the calculation. Strengtened by the previous hypothesis, the use of SAR models facilitates systematic screening of virtual libraries of compounds whose existence is updated by the computer5,6. To ensure that the molecules have the desired or corresponding properties to achieve the goal, one must realize the virtual screening of these libraries by using a fairly

82

Page 97: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Classification of Phospholipidosis

reliable and accurate method. This method can be developed or established by the use of quantitative structure-activity relationship (QSAR). The ability to develop effective rules that accurately predict the pharmacokinetic properties of drugs remains a major concern and an important asset for the design and development of drugs. The application of these rules not only provides free access to a rotational pathway to drug discovery7 but can also, to some extent, lead to a systematic decrease of project failures that are consistently associated with pharmacokinetic problems8,9. Having this aim in mind, the active search for comprehensive and reliable approaches to optimize the pharmacokinetic properties of drug compounds8 has grown considerably. For this purpose, many research studies have been done on PPL through different methods and techniques10,17. Here, we predict PPL induction as a function of the octanol/water partition coefficient (logPow), the molecular descriptors calculated with ParaSurf, the van der Waals’ surface, and two ML techniques, the Naive Bayes (NB) and the Random Forest (RF).

Drug-induced PPL is a phenomenon that is characterized by a sporadic occurrence of

phospholipids in intracellular lysosomes. Experimentally, it has not yet been proven that a close relationship between in vitro drug-induced PPL and the drug’s side effects exists in humans. Therefore, this does not eliminate the hypothesis that drug-induced PPL creates a favorable environment for the emergence of toxicity. Thanks to scientific and technical progress achieved in recent decades, the detection of drug-induced PPL has been particularly effective and efficient when it is performed with electron microscopy and quantitative PCR. In parallel, the detection of drug-induced PPL in HepG2 cells18 is quite relevant when performed by an assay with high throughput LipidTox and a fluorescent lipophilic dye. The fluorescent probe technique is a rather practical test, which can be done within a short period of time on a relatively small sample of compounds. SAR is the approach most commonly used in computational chemistry for the determination of drug-induced PPL. Through it, the chemical compounds are classified according to their ability to induce, or not induce, effects via a biological receptor, active or inactive, respectively. The Hansch model19,20, while being the fundamental basis, can also be regarded as an exceptional reference for modern SAR and QSAR models. Its basic principle relies on the fact that it expresses qualitatively and quantitatively a physico-chemical property from a linear statistical correlation with steric, electronic and hydrophobic indices of chemical structures21. In order to extend the Hansch model, SAR and QSAR models have made use of new classes of structural descriptors or sufficiently powerful statistical models. From a qualitative point of view, a numerical descriptor can be likened to a digital representation of some essential molecular features, such as empirical indices (Hammett and Taft substituent constants), physical properties (logPow, dipole moment, or aqueous solubility), the number of substructures or substituents, graph descriptors22-24, topological indices25-27, connectivity indices28,29, electrotopological indices30,31, geometrical descriptors (molecular surface and volume), quantum indices (atomic charges, HOMO and LUMO energies)32,33, and molecular fields (steric, electrostatic, and hydrophobic)34.

4.2 Methods

We used a data set of 144 compounds listed in Table A9 of the Appendix, which when assayed were positive for PPL induction as determined by transmission electron microscopy35, provided by Anne Tilloy-Ellul (Pfizer Global R&D, Amboise Laboratories,

83

Page 98: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 4

France) and Marcel de Groot (Pfizer Global R&D, Sandwich Laboratories, UK). The conversion to 3D structures of the Pfizer set of 144 canonical SMILES36was made with CORINA37,38. Geometries were optimized in the gas phase using the AM1, AM1*, MNDO, MNDO/d, PM3, or the PM6 Hamiltonian in VAMP 1139. For AM1* and MNDO, diclofenac was not optimized because Na is not parameterized in VAMP for these Hamiltonians. The 124 molecular descriptors listed in Table A10 of the Appendix were calculated with ParaSurf10alpha40, using the default isodensity surface41 (iso) or the solvent-excluded surface42 (SES). The logPow values of these compounds were calculated using the binned SIM models developed in the preceding chapter and added as additional descriptors to the ParaSurf´s40 standard descriptors, creating a set of 125 descriptors. Ceftazidime (which was duplicated), cephaloridine, and paraquat, which are compounds containing quaternary nitrogen, were removed, because they cannot be neutralized and they give large values of logPow compared to those stored in PubChem. In Figures 4.2-4.4 these compounds with their predicted logPow values using the AM1 Hamiltonian and the SES42 are shown.

Figure 4.2. Ceftazidime predicted logPow: -12.72, XlogP3 (PubChem): 0.4.

Figure 4.3. Paraquat predicted logPow: -37.64, XlogP3-AA (PubChem): 1.7.

84

Page 99: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Classification of Phospholipidosis

Figure 4.4. Cephaloridine predicted logPow: -13.52, XlogP3 (PubChem): 1.9.

One of each of the duplicated carbon tetrachlorides and valproic acid were removed. Positive charges (+1) are assigned to active compounds (inducing PPL), and negative charges (-1) are assigned to inactive compounds (non-inducing PPL). The created set of 125 descriptors necessary to capture our chemical information was used to randomize our data set of 138 compounds. The randomized data was then divided into two sets, 50% for the training set and the remainder used to assess the generalization performance. Therefore, the training set of 69 compounds consists of 44 positives and 25 negatives and the test set of 69 compounds consists of 37 positives and 32 negatives. All the models obtained from the training set were evaluated on their respective test (or validation) set, using a 10-fold cross-validation.

4.2.1 Machine Learning Algorithms

Today, predicting the biological activity of some drugs has been facilitated by the further development of easily accessible and manipulated software. Thus, in molecular modeling, we can within a short period of time, and with an extraction of rules and functions from large data sets, develop models using some common methods and algorithms such as ML. ML is a fairly extensive area of artificial intelligence15, which includes, among others, decision trees, k-nearest neighbors, lazy learning, Bayesian methods, Gaussian processes, artificial neural networks, artificial immune systems, support vector machines and kernel algorithms. The specificity of the ML algorithms is based on the use of calculation methods and statistics for the prediction of new properties by a systematic extraction of information from experimental data. All SAR models based on ML algorithms are efficiently generated using the ML software weka (http:www.cs.waikato.ac.nz/ml/weka/)43,44.

4.2.1.1 Naive Bayes

The basic principle of the Bayes classifier is based on Bayes' theorem formulated by the famous British mathematician Thomas Bayes (1702-1761). Any classification made using the classifier NB thereby becomes a probabilistic classification with strong assumptions of independence (Naive), constructed from the conditional model45

85

Page 100: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 4

),...,|( 1 nFFCp over a dependent class variable C with a relatively reduced number of outcomes or classes, conditional on several feature variables 1F through nF . Using the Bayes’ theorem, the model can be assimilated into the mathematical relation

),...,|( 1 nFFCp = ),...,(

)|,...()(

1

1

n

n

FFpCFFpCp

. (4.1)

Under certain independent assumptions, the conditional distribution over the class variable C becomes the equation

),...,|( 1 nFFCp = Z1 ∏

=

n

ii CFpCp

1

)|()( . (4.2)

Here, Z is a scaling factor dependent only on 1F ,…, nF . For this model, the associated classifier is the classify function given below

Classify ),...,( 1 nff = argmax c )|()(1

cCfFpcCP i

n

ii === ∏

=

. (4.3)

4.2.1.2 Random Forest

Suggested by Breiman46, RF is a type of classification method that uses a collection of unpruned trees to determine the output class of a given observation. It is a collection of tree predictors formulated as

h(x; kΘ ), k = 1,…,K where x is the observed input (covariate) vector of length p with associated random vector X and kΘ are independent identically distributed (iid) random vectors.

In order to optimize the performance of this learning technique, Breiman introduced the notion of the margin function for a set of classifiers { )(1 xh , )(2 xh ,…, )(xhk } as follows:

86

Page 101: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Classification of Phospholipidosis

mg(X,Y) = ka I( kh (X) = Y) - Yj≠max ( ka I( kh (X=j)) (4.4) in which ka is the average and I the indicator function. However, for any classification carried out, the predictions are given in the form of the following conditional probability: If mg(X,Y) > 0, then the set of classifiers votes for the correct classification. If mg(X,Y) < 0, then it votes for the incorrect classification.

The NB and the RF algorithms are fully implemented in weka43,44. 4.3 Results

PPL, as seen in Figure 4.547, is an anomaly caused by a lysosomal overload, characterized by the successive deposit of layers of phospholipids in tissues producing lamellar and concentric bodies.

Figure 4.5. (A) Lipid-filled laminated bodies. Transmission electron micrograph of lysomal lamellar bodies (LLB) of PPL in kidney tissue. (B) Crystallin. Transmission electron

micrograph of LLB of PPL in lung tissue. (C) Zebra. Transmission electron micrograph of LLB of PPL in soft tissue. Reprinted from 47.

87

Page 102: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 4

4.3.1 Machine Learning Models

In order to optimize the biological activity, the selectivity, or the physico-chemical properties of molecular compounds, the use of SAR and QSAR models based on ML algorithms, over time, seems to be of paramount importance for the development of drugs.

The results obtained by applying the two ML classifiers to our database are given in

the following sections. Among these results, those obtained from a set of descriptors calculated with the SES42 for the test set are listed in the form of confusion matrices, where the values (-1) and (+1) are assigned to non-induction and induction of PPL, respectively.

4.3.1.1 Naive Bayes Models

NB is a classifier that is distinguished particularly by its ability to produce anti-spam

filters. Generally it is the most appropriate algorithm for the classification of objects into binary categories. It is a Bayesian classifier.

Implementing the NB classifier on the set of descriptors obtained from the different Hamiltonians and surfaces gives the results listed below, which are enumerated starting with the one obtained with AM1 and the SES: Scheme: weka.classifiers.bayes.NaiveBayes Correctly Classified Instances 54 78.2609 % Incorrectly Classified Instances 15 21.7391 %

=== Confusion Matrix ===

a b ← classified as 21 11 | a = -1 4 33 | b = 1

Figure 4.6. Confusion matrix for the test set obtained from the NB classification model using

the descriptors calculated with AM1 and the SES.

Running the NB algorithm on the training set yields a model that, evaluated on the respective test set, gives a prediction accuracy of 66% for negatives and 89% for positives. The overall accuracy is 78%. Among the 32 negative compounds 21 are correctly classified, and for the 37 positive compounds 33 of them are correctly classified. Another important aspect is a difference of ≈ 23% between the accuracies of the positive and negative compounds. This difference is quite high and therefore there is no good similarity in the confusion matrix, although there is an overall accuracy higher than 75%.

The performances of all models generated by running a NB classifier on a set of descriptors calculated with AM1 are given in Table 4.1.

88

Page 103: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Classification of Phospholipidosis

Table 4.1. Performances of the training and test set models generated by the NB classifier on a set of descriptors obtained with AM1

Model True positive

True negative

False positive

False negative

Accuracy (%)

Iso

Training set

35 14 11 9 71

Test set 30 20 12 7 72

SES Training

set 33 13 12 11 67

Test set 33 21 11 4 78

When the models generated with the AM1 Hamiltonian are applied to the test sets, the overall accuracy obtained is 78% for the SES and 72% for the iso. There is an increase of the accuracy by ≈ 6% with the SES, compared to the iso. For the training set there is a decrease of the accuracy by ≈ 4% when replacing the iso with the SES.

The set of descriptors calculated with AM1* and the SES, classified with the NB algorithm, results in obtaining the following model: Scheme: weka.classifiers.bayes.NaiveBayes Correctly Classified Instances 50 72.4638 % Incorrectly Classified Instances 19 27.5362 %

=== Confusion Matrix ===

a b ← classified as 20 12 | a = -1 7 30 | b = 1

Figure 4.7. Schematic of the confusion matrix obtained by applying the model generated by

training the NB on the descriptors of the training set generated with AM1* and the SES on the respective test set.

Performing the NB algorithm with the training set allows the attainment of a model

that, when used to classify the respective test set, yields a prediction accuracy of 63% for negatives and 81% for positives. The overall accuracy is 72%, and for this case, 20 among the 32 negative compounds are correctly classified and 30 of the 37 positives are correctly classified. From the above statistics, a difference of ≈ 18% is obtained between the accuracies of the positive and negative compounds, which is a little high and thus a slight similarity is manifested in the confusion matrix.

Table 4.2 gives the statistical significance of the models obtained when using the set of descriptors attained from AM1* through a NB classifier.

89

Page 104: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 4

Table 4.2. Performances of the training and test set models generated by the NB classifier on a set of descriptors obtained with AM1*

Model True positive

True negative

False positive

False negative

Accuracy (%)

Iso

Training set

35 12 12 9 69

Test set 28 18 14 9 67

SES Training

set 34 15 9 10 72

Test set 30 20 12 7 72

Modeling wth the AM1* Hamiltonian predicts for the test set with an overall accuracy of 72% with the SES and 67% with the iso. In comparison to the iso, the use of the SES results in increasing the overall accuracy by ≈ 5%. Here, we observe that for the SES the same overall accuracy value is obtained for both the training and test sets. The difference between their performances is just the predictivity of the negative and positive compounds. There is a difference of ≈ 3% between the predictive powers of the training set models generated with the iso and the SES, with the SES increasing the performance of the training set for AM1*. The main observation here is that all the models generated with the NB for the descriptors when they are calculated with AM1* are surface-dependent.

Chemical information attained through the descriptors obtained from MNDO and the

SES was used with the NB classifier to generate the model in Figure 4.8. Scheme: weka.classifiers.bayes.NaiveBayes Correctly Classified Instances 54 78.2609 % Incorrectly Classified Instances 15 21.7391 %

=== Confusion Matrix ===

a b ← classified as 21 11 | a = -1 4 33 | b = 1

Figure 4.8. Representation of the confusion matrix of the test set obtained by training the NB

on the chemical information generated with MNDO and the SES.

A NB trained on the SES fitted properties yields a model where the prediction accuracy of the negative compounds is 66%, and that of the positive compounds is 89%, with an overall accuracy of 78%. Here, the correct classifications are 21 among 32 and 33 among 37 for the negative and the positive compounds, respectively. This leads to a high difference of ≈ 23% between the accuracies of the negative and positive compounds, which simply justifies the presence of a poor similarity in the above confusion matrix, although presenting a good general performance.

The predictive powers of all models obtained by performing a NB classification on a set containing chemical information obtained with MNDO are given in Table 4.3.

90

Page 105: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Classification of Phospholipidosis

Table 4.3. Performances of the training and test set models generated by the NB classifier on a set of descriptors obtained with MNDO

Model True positive

True negative

False positive

False negative

Accuracy (%)

Iso

Training set

35 14 10 9 72

Test set 32 19 13 5 74

SES Training

set 35 15 9 9 74

Test set 33 21 11 4 78 Considering the test set models, it appears that the overall accuracy is 78% for the SES

and 74% for the iso. There is an increase in the accuracy by ≈ 4% with the SES, compared to the iso. For the training set, there is also an increase in the overall accuracy by ≈ 2% when comparing the model generated with the SES to the one obtained with the iso. As with the AM1* Hamiltonian, the NB classifier provides some models that are surface-dependent for the MNDO Hamiltonian.

The descriptors of the training and test sets obtained from MNDO/d and the ParaSurf calculations, subjected to a NB classification, allow the creation of a model with the statistical performances below:

Scheme: weka.classifiers.bayes.NaiveBayes Correctly Classified Instances 50 72.4638 % Incorrectly Classified Instances 19 27.5362 %

=== Confusion Matrix ===

a b ← classified as 23 9 | a = -1 10 27 | b = 1

Figure 4.9. Representation of the confusion matrix for the descriptors of the test set generated

with MNDO/d and the SES subjected to the NB classification.

The NB using the descriptors generated with the MNDO/d Hamiltonian and the SES provides prediction accuracies of 72% and 73% for the negative and positive compounds, respectively, of the test set. Twenty-three of 32 negative compounds are correctly predicted, and 27 of 37 positive compounds, resulting in an overall accuracy of 72%. Although the overall accuracy is below 75%, the difference of ≈ 1% between the accuracies of the negative and positive compounds is relatively small, which justifies the presence of this good similarity in the confusion matrix.

The statistical details about the performances of all models generated by running a NB classification on a set of descriptors obtained with MNDO/d are given in Table 4.4.

91

Page 106: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 4

Table 4.4. Performances of the training and test set models generated by the NB classifier on a set of descriptors obtained with MNDO/d

Model True positive

True negative

False positive

False negative

Accuracy (%)

Iso

Training set

39 13 12 5 75

Test set 31 17 15 6 70

SES Training

set 32 13 12 12 65

Test set 27 23 9 10 72

Performing the NB algorithm, the descriptors of the test set generated with the SES yield an overall accuracy of 72%, and 70% for those obtained with the iso. There is an increase in the accuracy by ≈ 2% when replacing the iso with the SES. In contrast, for the training set there is a decrease of ≈ 10% when the iso is replaced with the SES. For the MNDO/d Hamiltonian, the SES helps increase the accuracy of the test set, but decreases the predictive power of the training set, in comparison to the iso.

With the NB, the training set obtained with the PM3 Hamiltonian and the SES helps create a model that with a test set gives the statistical performances below: Scheme: weka.classifiers.bayes.NaiveBayes Correctly Classified Instances 46 66.6667 % Incorrectly Classified Instances 23 33.3333 %

=== Confusion Matrix ===

a b ← classified as 16 16 | a = -1 7 30 | b = 1

Figure 4.10. Representation of the predictive powers obtained with the NB trained on

descriptors calculated with PM3 and the SES for negative and positive compounds of the test set in the form of a confusion matrix.

The set of 125 descriptors obtained is classified by the NB to generate a model that

predicts correctly 16 of the 32 negative compounds, giving an accuracy of 50%, and 30 of the 37 positive ones, with an accuracy of 81%. The overall accuracy is 67%, and the difference between the predictive performances of the negative and positive compounds is ≈ 31%, which is very high. Moreover, obtaining a poor overall performance confirms this low or no similarity in the confusion matrix.

Table 4.5 below describes statistically all the models generated by performing a NB classification on a set of descriptors obtained with PM3.

92

Page 107: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Classification of Phospholipidosis

Table 4.5. Performances of the training and test set models generated by the NB classifier on a set of descriptors obtained with PM3

Model True positive

True negative

False positive

False negative

Accuracy (%)

Iso

Training set

41 10 15 3 74

Test set 35 7 25 2 61

SES Training

set 36 13 12 8 71

Test set 30 16 16 7 67

Running the models obtained through a NB classifier on the test set yields overall accuracies of 67% for the SES and 61% for the iso. There is an increase in the accuracy by ≈ 6% with the SES, compared to the iso. In contrast, for the training set, the SES decreases the predictivity by ≈ 3% compared to the iso.

Performing the NB probabilistic classification on a set of descriptors created with PM6 and the SES leads to the model described below: Scheme: weka.classifiers.bayes.NaiveBayes Correctly Classified Instances 50 72.4638 % Incorrectly Classified Instances 19 27.5362 %

=== Confusion Matrix ===

a b ← classified as 20 12 | a = -1 7 30 | b = 1

Figure 4.11. The confusion matrix obtained by performing the NB classification on the set of

descriptors obtained with PM6 and the SES for the test set.

The NB algorithm applied to a set of descriptors of the training set yields a model that gives accuracies of 63% and 81% for the negative and positive compounds, respectively, when used to classify a set of compounds previously selected and considered as a test set. This model of which 20 of 32 and 30 of 37 compounds are correctly classified for the negative and positive compounds, respectively, gives an overall accuracy of 72%, which is below 75%. The difference between the accuracies of the negative and positive predictions is ≈ 18%, implying that there is an absence of a similarity in the confusion matrix.

Table 4.6 lists all statistical values for the models realized by applying a NB classifier on a set of descriptors obtained with PM6.

93

Page 108: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 4

Table 4.6. Performances of the training and test set models generated by the NB classifier on a set of descriptors obtained with PM6

Model True positive

True negative

False positive

False negative

Accuracy (%)

Iso

Training set

38 11 14 6 71

Test set 28 20 12 9 70

SES Training

set 38 13 12 6 74

Test set 30 20 12 7 72

The models generated from a set of descriptors calculated with PM6 give overall accuracies of 72% and 70% for the SES and the iso, respectively, when evaluated on their respective test sets. This, therefore, leads to an increase in the accuracy by ≈ 2% when changing from an iso to a SES. A similar situation is obtained for the training set with an increase in the accuracy by ≈ 3%. All the models performed with the NB, when the descriptors are calculated with the PM6, are surface-dependent, and the use of the SES helps increase the predictive power of the models, compared to the iso.

In order to check the effect of another classifier on our data, the RF algorithm was performed on the different sets of descriptors previously used, and the results obtained are presented below.

4.3.1.2 Random Forest Models RF is a classification technique that consists of a methodical combination of tree

predictors obtained successively. Each tree is parameterized by the values of a random vector sampled independently and with the same distribution for all trees in the forest46. It is an ensemble method.

The models for the PPL classifications generated by a 10-fold cross-validation run

through a RF classifier on the descriptors of the training and test sets calculated with the different Hamiltonians and surfaces are given below, starting with the one obtained with AM1 and the SES: Scheme: weka.classifiers.trees.RandomForest -I 10 -K 0 -S 1 Correctly Classified Instances 52 75.3623 % Incorrectly Classified Instances 17 24.6377 %

=== Confusion Matrix ===

a b ← classified as 23 9 | a = -1 8 29 | b = 1

Figure 4.12. Confusion matrix for the test set obtained using RF, the AM1 Hamiltonian and

the SES.

94

Page 109: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Classification of Phospholipidosis

RF trained on the set of descriptors gives statistical performances of 72% for the negative compounds and 78% for the positive ones. The difference between the two predictive powers is ≈ 6%, which is slightly low and implicates the existence of a good similarity in the confusion matrix. For this model where 23 of the 32 negative compounds and 29 of the 37 positive ones are correctly classified, an overall accuracy of 75% is obtained. Compared to NB, there is a significant decrease in the difference between the predictive powers for negative and positive compounds, from 23% to 6%, leading to a good improvement in the similarity of the confusion matrix.

The performances of the models generated when a RF classifier is run on a set of descriptors calculated with AM1 are given in Table 4.7. Table 4.7. Performances of the training and test set models generated by the RF classifier on

a set of descriptors obtained with AM1Model True

positive True

negative False

positive False

negative Accuracy

(%)

Iso Training

set 32 14 11 12 67

Test set 32 18 14 5 72

SES Training

set 33 16 9 11 71

Test set 29 23 9 8 75

With the AM1 Hamiltonian, performing the RF approach on a set of descriptors calculated with the SES leads to an increase in the accuracies of the training and test sets by ≈ 4%, and ≈ 3%, respectively, when a comparison is made with the calculations performed with the iso. The RF classifier provides some models that are entirely surface-dependent. A RF approach trained on a set of descriptors calculated with AM1* and the SES generates the model below: Scheme: weka.classifiers.trees.RandomForest -I 10 -K 0 -S 1 Correctly Classified Instances 54 78.2609 % Incorrectly Classified Instances 15 21.7391 %

=== Confusion Matrix ===

a b ← classified as 26 6 | a = -1 9 28 | b = 1

Figure 4.13. Representation of the confusion matrix obtained when the model generated by

training a RF classifier on the descriptors calculated with AM1* and the SES is applied to the test set.

An overall accuracy of 78% is obtained for the test set when performing the RF

algorithm on the selected training set. Twenty-six of the 32 negative compounds are correctly predicted, and 28 of the positive ones, leading to performances of 81% and 76% for the negative and positive compounds, respectively. Here, there is a gap of ≈ 5% between the

95

Page 110: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 4

predictivities of the negative and positive compounds and a very good similarity in the confusion matrix, which is probably related to this small difference between the predictive powers of the negative and positive compounds. For the same set of descriptors, RF helps improve the predictive power and correct problems with the similarity of the confusion matrix observed with NB. The difference between the predictivities of negative and positive compounds changes from 18% to 5%.

In Table 4.8, the statistical significance of the training and test set models realized by running a set of descriptors obtained from AM1* through a RF’s algorithm is presented. Table 4.8. Performances of the training and test set models generated by the RF classifier on

a set of descriptors obtained with AM1*Model True

positive True

negative False

positive False

negative Accuracy

(%)

Iso Training

set 31 12 12 13 63

Test set 29 17 15 8 67

SES Training

set 34 14 10 10 71

Test set 28 26 6 9 78

The overall accuracy is ≈ 78% for the SES and ≈ 67% for the iso when using the training model obtained with the RF classifier on the descriptors of the test set for AM1*. This yields an increase in the accuracy by ≈ 11% with the SES, compared to the iso. The same observation is made for the training set models, where the use of the SES leads to an increase in the accuracy by ≈ 8%; therefore, all the models generated here are surface-dependent.

Mapping the data obtained from MNDO and the SES calculations through a RF classification results in creating a model whose characteristics are given below: Scheme: weka.classifiers.trees.RandomForest -I 10 -K 0 -S 1 Correctly Classified Instances 52 75.3623 % Incorrectly Classified Instances 17 24.6377 %

=== Confusion Matrix ===

a b ← classified as 25 7 | a = -1 10 27 | b = 1

Figure 4.14. Confusion matrix representing the statistical distribution of the predictivity of the test set obtained by training the RF classifier on descriptors calculated with MNDO and

the SES.

This model obtained with the RF algorithm is characterized by statistical performances of 75% for positive and negative compounds, 78% for the negative compounds, and 73% for the positive ones. A small difference of ≈ 5% is obtained between the accuracies of positive and negative compounds, which fully justifies the existence of this good similarity in the

96

Page 111: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Classification of Phospholipidosis

confusion matrix. Twenty-five of the 32 negative compounds are correctly classified and 27 of the 37 positive ones. The difference between the accuracies of the positive and negative compounds is reduced from 23% to 5% when the NB is replaced by the RF classifier.

The predictive powers of the training and test set models obtained by performing a RF classification on a set containing chemical information gained from MNDO calculations are listed in Table 4.9. Table 4.9. Performances of the training and test set models generated by the RF classifier on

a set of descriptors obtained with MNDOModel True

positive True

negative False

positive False

negative Accuracy

(%)

Iso Training

set 34 14 10 10 71

Test set 30 20 12 7 72

SES Training

set 36 16 8 8 76

Test set 27 25 7 10 75

The overall accuracy for the test set is 75% when generating the descriptors with the SES and 72% for the iso. There is an increase in the accuracy by ≈ 3% with the SES, compared to the iso. Using the descriptors of the training set obtained with the SES for model generation helps in increasing the accuracy by ≈ 5%. The training and test set models are both surface-dependent.

Combining the different tree predictors derived from an application of the RF algorithm to a set of descriptors obtained with MNDO/d and the SES gives the model presented below: Scheme: weka.classifiers.trees.RandomForest -I 10 -K 0 -S 1 Correctly Classified Instances 55 79.7101 % Incorrectly Classified Instances 14 20.2899 %

=== Confusion Matrix ===

a b ← classified as 25 7 | a = -1 7 30 | b = 1

Figure 4.15. Statistical distribution of the predictivity of negative and positive compounds of the test set obtained by training the RF classifier on descriptors calculated with MNDO/d and

the SES in a confusion matrix.

A set of descriptors obtained from the MNDO/d Hamiltonian and the SES, classified with a RF algorithm, gives a model with predictive powers of 80%, 78%, and 81% for positive and negative compounds, negative compounds, and positive compounds, respectively. A small gap of ≈ 3% exists between the predictive powers of the negative and positive compounds. Twenty-five of the 32 negative compounds are correctly predicted, and 30 of the 37 positive compounds. There is a very good similarity in the confusion matrix,

97

Page 112: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 4

which is justified by the highest value for the overall accuracy and the higher predictive powers obtained for negative and positive compounds. With NB there is a small difference of about 1% between the predictivities of positive and negative compounds, but with an overall accuracy of 72%, RF significantly improves the overall performance, while maintaining a low difference of about 3% between the predictivities of positive and negative compounds.

The statistical information on the performances of the models for the training and test sets generated by training a RF classifier on a set obtained with MNDO/d are presented in Table 4.10. Table 4.10. Performances of the training and test set models generated by the RF classifier on

a set of descriptors obtained with MNDO/dModel True

positive True

negative False

positive False

negative Accuracy

(%)

Iso Training

set 33 11 14 11 64

Test set 28 23 9 9 74

SES Training

set 29 13 12 15 61

Test set 30 25 7 7 80

The overall accuracies are 80% for the SES and 74% for the iso when the models obtained with MNDO/d are evaluated by a RF classification of their respective test sets. There is an increase in the accuracy by ≈ 6% when the iso is replaced by the SES for the calculation of the descriptors for the test set. In contrast, there is a decrease of ≈ 3% for the training set when the model obtained with the SES is compared to the one generated with the iso.

With a RF classifier, parameterizing all trees by a random vector stemming from the chemical information attained with PM3 and the SES generates the following model:

Scheme: weka.classifiers.trees.RandomForest -I 10 -K 0 -S 1 Correctly Classified Instances 58 84.0580 % Incorrectly Classified Instances 11 15.9420 %

=== Confusion Matrix ===

a b ← classified as 25 7 | a = -1 4 33 | b = 1

Figure 4.16. Configuration of the confusion matrix for the test set obtained with RF for a set

of descriptors generated with the PM3 Hamiltonian and the SES.

Mapping the local properties onto the SES for each compound yields a set of descriptors. This set of descriptors with the additional logPow, trained through a RF classification, gives a model where seven compounds are wrongly predicted as positive, and four wrongly predicted as negative. This provides the best model with an overall accuracy of 84% and accuracies of 78% and 89% for the negative and positive compounds, respectively. A very good similarity exists in the confusion matrix, where the difference between the

98

Page 113: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Classification of Phospholipidosis

accuracies of negative and positive compounds is ≈ 11%. Compared to the model obtained with the NB classifier, RF significantly improves the overall performance and the similarity of the confusion matrix, with a large reduction in the difference between the predictivities of positive and negative compounds, down from 31% to 11%.

Table 4.11 describes statistically the training and test set models generated by performing a RF classification technique on a set obtained with PM3. Table 4.11. Performances of the training and test set models generated by the RF classifier on

a set of descriptors obtained with PM3Model True

positive True

negative False

positive False

negative Accuracy

(%)

Iso Training

set 31 13 12 13 64

Test set 30 23 9 7 77

SES Training

set 35 15 10 9 72

Test set 33 25 7 4 84

When testing the models generated, the overall accuracy is 84% for the SES and 77% for the iso. There is an increase in the accuracy of ≈ 7% with the SES, compared to the iso. Using the SES to train the models gives an increase of ≈ 8% more than the iso. Performing RF classification when the geometry’s optimization is realized with the PM3 Hamiltonian yields models that are surface-dependent.

Using the collection of unpruned trees obtained by running a RF approach on the information extracted from our data set with PM6 and the SES yields a model whose performances are described below: Scheme: weka.classifiers.trees.RandomForest -I 10 -K 0 -S 1 Correctly Classified Instances 53 76.8116 % Incorrectly Classified Instances 16 23.1884 %

=== Confusion Matrix ===

a b ← classified as 21 11 | a = -1 5 32 | b = 1

Figure 4.17. RF predictions for the test set via a confusion matrix when the descriptors are

calculated with PM6 and the SES.

With a 10-fold cross-validation scheme, run through RF, a model of classification for drugs inducing PPL is obtained, where 21 compounds are correctly classified as non-inducers of PPL and 32 correctly classified as inducers. For this model, the accuracies are 66% for the negative compounds, 86% for the positive ones, and 77% for the positive and negative compounds. There is a very large difference of ≈ 20% between the accuracies of the negative and positive compounds, leading to a weak similarity in the confusion matrix. PM6 seems to be exceptional in regard to other Hamiltonians. Compared to NB, RF helps in increasing the

99

Page 114: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 4

overall accuracy. However, unlike the previous observations, there is an increase in the difference between the predictivities of the positive and negative compounds, which rises from 18% to 20%. Table 4.12 presents the statistical values for the training and test set models realized by applying a RF algorithm on a set of descriptors obtained with PM6. Table 4.12. Performances of the training and test set models generated by the RF classifier on

a set of descriptors obtained with PM6Model True

positive True

negative False

positive False

negative Accuracy

(%)

Iso Training

set 32 15 10 12 68

Test set 34 19 13 3 77

SES Training

set 37 12 13 7 71

Test set 32 21 11 5 77

This is a particular situation where the overall accuracy is 77% for both the SES and iso when the models are evaluated on their respective test sets. For the training set, there is an increase of ≈ 3% with the SES, compared to the iso. Performing the RF classification on a set of descriptors generated with PM6 yields models in which the predictive powers are unchanged when applied to their respective test sets using either the SES or the iso. The major difference between the two models generated based on the descriptors calculated with the PM6 Hamiltonian, the SES and the iso is their predictive powers for negative and positive compounds.

Figure 4.18 below gives a general description of the predictive powers of the models obtained by training a NB or RF classifier on the descriptors of the training set calculated with the different Hamiltonians and surfaces when evaluated on their respective test sets.

100

Page 115: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Classification of Phospholipidosis

Figure 4.18. Distribution of the accuracies of the models trained with NB and RF and evaluated on the test sets with the different Hamiltonians and surfaces. (I) and (S) represent

iso and the SES, respectively.

The histogram shows that with the NB algorithm there is a continuous increase in the accuracy when the models are generated with the descriptors calculated with the SES. The same situation is observed for the RF, except for the models generated with the PM6 Hamiltonian where the predictive power is the same when the descriptors used are calculated either with the SES or the iso. For the AM1* Hamiltonian, the use of descriptors calculated with the iso produces models that have the same predictive power when evaluated on their test sets by both the NB and the RF algorithms.

The Matthews Correlation Coefficient (MCC)48 is an important parameter necessary in evaluating the predictive power of a model. It is calculated using the standard formula

))()()(( FNTNFPTNFNTPFPTP

FNFPTNTPMCC++++

×−×= . (4.5)

Here, TP is the number of compounds correctly classified as inducers of PPL, and TN the number of compounds correctly classified as non-inducers of PPL. FP and FN are related to the number of compounds wrongly classified as inducers and non-inducers of PPL, respectively. In order to measure the predictivity of each classifier, the MCC value of the test set for each model generated with the NB and the RF classifiers was calculated and the values obtained are reported in Table 4.13.

101

Page 116: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 4

Table 4.13. Calculated values of the MCC for the test set of each model obtained with a 10-fold cross-validation

Hamiltonian and surface

NB RF Hamiltonian and surface

NB RF

AM1 (iso) 0.45 0.45 AM1 (SES) 0.57 0.50

AM1* (iso) 0.33 0.33 AM1* (SES) 0.45 0.57

MNDO (iso) 0.48 0.45 MNDO (SES)

0.57 0.51

MNDO/d (iso)

0.39 0.48 MNDO/d (SES)

0.45 0.59

PM3 (iso) 0.24 0.53 PM3 (SES) 0.33 0.68

PM6 (iso) 0.39 0.55 PM6 (SES) 0.45 0.54

Average 0.38 0.47 Average 0.47 0.57

As seen in Table 4.13, the MCC values for each model generated with the RF and the NB algorithms increase with the SES for AM1, AM1*, MNDO, MNDO/d, and PM3. The hypothetical situation is only for the models generated by performing the RF classification on sets of descriptors obtained with PM6, the iso and the SES. Although the two models have the same overall accuracy (77%) as mentioned above, there is a decrease in the MCC value of ≈ 0.01 with the SES. However, with the NB, there is an increase in the MCC value when performed on sets of descriptors calculated with PM6, the iso and the SES. The use of the SES improves the average of the MCC by ≈ 0.11 for the RF and ≈ 0.09 for the NB when compared to the iso. The average value of the MCC obtained with RF is greater than that calculated by performing the NB algorithm for both surfaces.

The predictions obtained from the best models are listed in Table A11 of the

Appendix.

The effect of each of the classifiers (NB and RF) on the compounds of the test set was investigated by sorting in increasing order the number of correct classifications over 12 models. Here the number of times the majority prediction was made for each compound over all of the 12 runs is calculated and listed in Table 4.14 below. LogP (calc) are the values of the logP of the best model obtained with the PM3 Hamiltonian and the SES. LogP (XLogP3), H-bond donor, and H-bond acceptor values are obtained from the predicted properties stored in PubChem, ChemSpider (Predicted-ACD/Labs Properties) or SciFinder, denoted with the references a, b, and c, respectively.

102

Page 117: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Classification of Phospholipidosis

Table 4.14. Number of correct classifications of 12 models by the NB and the RF classifiers for all the Hamiltonians and surfaces

Compound logP (calc)

logP H-bond donor

H-bond acceptor

Inductance NB RF

Not predictive Methadone 3.56 3.9a 0a 2a -1 0 0Clociguanil 2.92 3.081b,c 4b,c 6b,c 1 0 Doxapram 3.13 3.3a 0a 3a -1 0 Procaine 2.27 2.256b,c 2b,c 4b,c -1 0

Bicalutamide 3.55 4.135b,c 2b,c 6b,c -1 0 Diflunisal 3.19 3.652b,c 2b,c 3b,c -1 0 Suramin 4.34 1.5a 12a 23a 1 0

Very slightly predictive Colchicine 2.08 1.066b,c 1b,c 7b,c -1 1 1Sulindac 3.12 3.4a 1a 5a -1 1 Bilirubin 4.52 2.9a 6a 6a 1 1

Rolitetracycline 2.34 1.657c 6c 11c -1 1 Tacrine 2.62 2.7a 1a 2a -1 1

Gemfibrozil 3.24 3.8a 1a 3a -1 2 Etoposide 0.16 0.275b,c 3b,c 13b,c -1 2 Suramin 4.34 1.5a 12a 23a 1 2

Flutamide 2.99 3.3a 1a 6a -1 2 Clociguanil 2.92 3.081b,c 4b,c 6b,c 1 1Doxapram 3.13 3.3a 0a 3a -1 1

Tunicamycin -0.35 -0.3a 11a 16a 1 1Bilirubin 4.52 2.9a 6a 6a 1 2Abacavir 1.29 1.158b,c 4b,c 7b,c -1 2

Slightly predictive Amiodarone 7.34 7.6a 0a 4a 1 4 Ketoconazole 3.74 4.043b,c 0b,c 8b,c 1 4 4

Amikacin -5.67 -5.262b,c

17b,c 18b,c 1 4

Tacrine 2.62 2.7a 1a 2a -1 4Carbamazepine 2.56 2.5a 1a 1a -1 5Rolitetracycline 2.34 1.657c 6c 11c -1 5

Predictive Amikacin -5.67 -

5.262b,c17b,c 18b,c 1 6

Tunicamycin -0.35 -0.3a 11a 16a 1 6 Chlorpromazine 5.26 5.2a 0a 3a 1 7 Carbamazepine 2.56 2.5a 1a 1a -1 7

Abacavir 1.29 1.158b,c 4b,c 7b,c -1 7 Valproic_acid 2.40 2.8a 1a 2a -1 7

AC-3579 3.37 2.154b,c 0b,c 5b,c 1 8 Tocainide 1.80 0.808b,c 3b,c 3b,c 1 7Procaine 2.27 2.256b,c 2b,c 4b,c -1 7

Hydrazine -0.90 -1.5a 2a 2a -1 7103

Page 118: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 4

Demeclocycline 1.10 0.942c 7c 10c -1 7Chlortetracycline 2.18 1.323b,c 7b,c 10b,c -1 8

Gemfibrozil 3.24 3.8a 1a 3a -1 8Tobramycin -4.54 -

4.224b,c15b,c 14b,c 1 8

WY-14643 3.73 3.958b,c 2b,c 5b,c -1 8Doxycycline 1.88 1.777b,c 7b,c 10b,c -1 8

Very predictive Tobramycin -4.54 -

4.224b,c15b,c 14b,c 1 9

Tocainide 1.80 0.808b,c 3b,c 3b,c 1 9 Etoposide 0.16 0.275b,c 3b,c 13b,c -1 9

Amiodarone 7.34 7.6a 0a 4a 1 9Sulindac 3.12 3.4a 1a 5a -1 9

Famotidine -0.65 -0.64b 8b 9b -1 9Acetaminophen 0.60 0.5a 2a 2a -1 10

Carbon tetrachloride

2.53 2.8a 0a 0a -1 10

Fenfluramine 4.55 3.554b,c 1b,c 1b,c 1 10Promethazine 3.72 4.8a 0a 3a 1 10

Quinacrine 6.12 6a 1a 4a 1 10Doxycycline 1.88 1.777b,c 7b,c 10b,c -1 10

Chlortetracycline 2.18 1.323b,c 7b,c 10b,c -1 10 Caffeine 0.48 -0.1a 0a 3a -1 11

Dibucaine 4.60 4.759b,c 1b,c 5b,c 1 11 11Galactosamine -2.57 -2.8a 5a 6a -1 11 11Hypoglicin-A 0.60 -2.5a 2a 3a -1 11 11Methyldopa 0.64 0.676b 4b 5b -1 11 11Piroxicam 0.96 0.588c 2c 7c -1 11 11Zileuton 1.62 1.6a 2a 3a -1 11 11

Acetaminophen 0.60 0.5a 2a 2a -1 11 Demeclocycline 1.10 0.942c 7c 10c -1 11

Hydrazine -0.90 -1.5a 2a 2a -1 11 Valproic_acid 2.40 2.8a 1a 2a -1 11

Diflunisal 3.19 3.652b,c 2b,c 3b,c -1 11Bicalutamide 3.55 4.135b,c 2b,c 6b,c -1 11

AC-3579 3.37 2.154b,c 0b,c 5b,c 1 11Tetracaine 2.81 3.7a 1a 4a 1 11

Methotrexate 0.37 -0.446b,c

7b,c 13b,c -1 11

Gentamicin -2.72 -1.887b 11b 12b 1 11Highly predictive

Chloroquine 4.79 4.6a 1a 3a 1 12 12Cyclizine 4.33 3.6a 0a 2a 1 12 12

Desipramine 4.09 3.972b,c 1b,c 2b,c 1 12 12 Hydroxyzine 3.45 3.7a 1a 4a 1 12 12

Chlorcyclizine 5.11 4.5a 0a 2a 1 12 12

104

Page 119: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Classification of Phospholipidosis

Clomipramine 5.16 5.2a 0a 2a 1 12 12Emetine 4.49 4.7a 1a 6a 1 12 12

Lysergide 3.13 3a 1a 2a 1 12 12AY-9944 6.08 6.395b,c 2b,c 2b,c 1 12 12

Norchlorcyclizine 4.08 3.4a 1a 2a 1 12 12Nortriptyline 4.51 4.5a 1a 1a 1 12 12Pheniramine 3.54 2.8a 0a 2a 1 12 12Phentermine 2.76 2.200b,c 2b,c 1b,c 1 12 12Tamoxifen 6.34 7.1a 0a 2a 1 12 12

Temozolomide -0.14 -1.1a 1a 5a -1 12 12Homochlorcyclizine 5.58 4.2a 0a 2a 1 12 12

Quinidine 2.40 2.823b,c 1b,c 4b,c 1 12 12SDZ-200125 3.45 3.351b,c 0b,c 5b,c 1 12 12

Stavudine -0.19 -0.647b,c

2b,c 6b,c -1 12 12

Trifluperazine 5.16 5a 0a 7a 1 12 12Triparanol 6.98 6.2a 1a 3a 1 12 12Netilmicin -2.12 -

1.840b,c11b,c 12b,c 1 12 12

Methotrexate 0.37 -0.446b,c

7b,c 13b,c -1 12

Tetracaine 2.81 3.7a 1a 4a 1 12 WY-14643 3.73 3.958b,c 2b,c 5b,c -1 12

Carbon tetrachloride

2.53 2.8a 0a 0a -1 12

Famotidine -0.65 -0.64b 8b 9b -1 12 Fenfluramine 4.55 3.554b,c 1b,c 1b,c 1 12 Promethazine 3.72 4.8a 0a 3a 1 12

Quinacrine 6.12 6a 1a 4a 1 12 Gentamicin -2.72 -1.887b 11b 12b 1 12

Caffeine 0.48 -0.1a 0a 3a -1 12Chlorpromazine 5.26 5.2a 0a 3a 1 12

Flutamide 2.99 3.3a 1a 6a -1 12

The number of correct predictions ranges between zero and 12, suggesting that over the 12 runs, both classes were predicted inequally.

Most of the compounds correctly predicted as actives by both NB and RF classifiers for the 12 models generated for each classifier have their physical properties parameterized as follows:

2 logP 5 0 H-Bond Donor 2 1 H-Bond Acceptor 7,

thus respecting Lipinski’s condition49, which states that a drug able to produce effects in humans must have a logP value not greater than 5 (logP units), the number of hydrogen bond donors not greater than five, and the number of hydrogen bond acceptors not greater than10.

105

Page 120: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 4

Among these compounds correctly classified as inducers of PPL in the 12 runs performed with each classifier, triparanol, tamoxifen, and AY-9944 have computed logP values much higher than 5 (logP units), but the number of hydrogen bond donors and the number of hydrogen bond acceptors are not greater than five and 10, respectively, therefore, excluding only one of the three rules of Lipinski’s hypothesis investigated here.

A special observation was made for netilmicin, which was correctly predicted 12

times as an inducer PPL when performed with the NB and the RF algorithms, with a logP < 0, and the number of hydrogen bond donors and acceptors equal to 11 and 12, respectively.

Compounds correctly classified as inactive have physical properties that are

characterized as follows: LogP < 0 1 H-Bond Donor 2 5 H-Bond Acceptor 6.

The statistical description of the number of correct predictions realized by the NB and

the RF classifiers for the 69 compounds of the randomly chosen test set is shown in Figure 4.19.

Figure 4.19. Distribution of the percentage of correct classifications by the NB and the RF for all the compounds of the test set.

It appears on the histogram that NB has a higher percentage of compounds correctly

predicted over 12 runs, C12 (NB predicts more correctly consistently), but also a higher percentage of misclassified compounds (C0) and for C1, C2, and C6. For class C12, the two classifiers have a percentage above 35%. The two algorithms give the same percentage just over 5% for class seven, but RF performs better than NB for C4, C5, C8, C9, C10, and C11. Here, NB and RF provide 11 prediction classes of unequal amplitudes.

106

Page 121: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Classification of Phospholipidosis

4.4 Discussion

Through this investigation into predicting compounds that induce PPL, it appears that logP is one of the most important physical properties necessary for the prediction of the biological activity of compounds, as are H-bond donors and H-bond acceptors. The compounds triparanol, tamoxifen, and AY-9944 with computed logP values of 6.2a, 7.1a, and 6.395b,c, respectively, were highly predicted by both the NB and RF classifiers, but excluded one of Lipinski’s rule of five49 among the three parameters (computed logP, H-bond donor, H-bond acceptor) chosen for our investigation. As seen in the previous chapter, the logP of compounds with a large number of rotatable bonds are difficult to predict. This may apply to these three compounds because the numbers of their rotatable bonds are 10a, eighta, and eighta for triparanol, tamoxifen, and AY-9944, respectively, and therefore they are very flexible compounds. This seems to be the case also for netilmicin, which with a number of rotatable bonds equal to eighta, excludes all three of Lipinski’s hypotheses analysed.

The accuracies of the training set models are between 61% and 76%, confirming the

robustness of the models. The data set used was composed of inactive and active compounds; therefore the classes are unbalanced. According to Lowe et al.1, accuracy is not a sufficient parameter that can be used to appreciate the prediction quality when the study is made with unbalanced classes. The determination of the MCC48 can be helpful in estimating the prediction quality. Theoretically the MCC48 value is between -1 and one, where -1 leads to a perfect anticorrelation, zero to the equivalent of random guessing and one to a perfect correlation. The calculated values of the MCC48 ranged from 0.24 to 0.68, and obtaining a value close to zero implies the presence of classes without prediction, which is related to an uninformative prediction. All the MCC48 values obtained here were above zero, implying that there is no anticorrelation, no uninformative prediction, but a fairly reliable correlation. Referring to the calculated values of the MCC48 listed in Table 4.13, the best average values of 0.57 and 0.47 were obtained for the models generated by applying RF and NB classifiers, respectively, to sets of descriptors calculated with the SES. There is indeed an improvement in the predictivity by ≈ 0.1 with the RF classification, compared to the NB. RF provides the best prediction quality when applied to descriptors generated with the SES, with an average MCC48 of 0.57, which is close to the values obtained by Lowe et al.1 (0.532 with their E-Dragon descriptors set and 0.539 for their combination of descriptors).

In Figure 4.19 the percentages higher than 35% obtained for the C12 class by the two

classifiers suggests a certain reliability of NB and RF in predicting compounds that have an ability to induce PPL. There were 25 cases ( ≈ 36 %) for RF and 31 cases ( ≈ 45 %) for NB in which compounds were correctly predicted consistently among all the 12 runs and several cases where compounds were incorrectly predicted. The large differences in percentage observed between the different prediction classes leads to the hypothesis that certain molecules are particularly difficult to predict, and the compounds of classes C0, C1, C2, C4, and C5 are those that adhere to this hypothesis. The most consistently well-predicted compounds in our randomly chosen test set and the most misclassified compounds are listed in Table 4.14. Doxapram and methadone, for example, were always misclassified by both NB and RF algorithms.

Figures 4.20 and 4.21 give more details about the abilities of RF and NB in predicting active and inactive compounds, using the created set of 125 descriptors.

107

Page 122: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 4

Figure 4.20. Histogram of active compounds correctly predicted by NB and RF

classifiers.

For both classifiers, RF predicted better than NB for C1, C10, and C11. In contrast, NB was better than RF at predicting for C6, C9, and C12, but both were equal for C0, C2, C4, C7, and C8, when investigating the predictivity of active compounds. There is a large gap between the percentages of C12 (more than 30%) and the other classes of prediction where the values range between 3% and 6%; this implies that the two classifiers predict active compounds more correctly and consistently. Here, there are ten classes of prediction for both RF and NB.

Figure 4.21. Histogram of inactive compounds correctly predicted by NB and RF

classifiers.

108

Page 123: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Classification of Phospholipidosis

RF predicted inactive compounds better than NB for C4, C5, C8, C9, and equally for C7, C10, and C11. In contrast, NB predicted better than RF for C0, C1, C2, and C12. Here, there is variability in the percentages of the different classes with dominance in the C11 class, followed by class C12. In contrast to Figure 4.20, there is not a very large gap between the percentages of the different classes. Another important point is that there are 11 classes of prediction for RF and only seven for NB.

At first glance, the major observation provided by Figures 4.19, 4.20, and 4.21 is the

constant dominance of NB compared to RF for class C12. From this, it seems clear that NB predicts better than RF, which contrasts the information provided in Table 4.13. However, these Figures have different classes of prediction in which there are some variances regarding the percentages of compounds correctly predicted. Therefore, the observed differences between the prediction percentages of NB and RF for class C12 would be fully offset by the other classes, which is consistent with the previously computed values for MCC48, and would validly support the hypothesis that the prediction quality of RF is better than NB. For active compounds where the number of compounds (37) is higher than for inactive compounds (32), NB and RF both provide 10 prediction classes. In contrast, for inactive compounds, RF gives 11 classes and NB seven classes of prediction. This permits us to hypothesize that NB predicts in relation to the proportion of elements present in a class. The 69 compounds used in our test set are dissimilar. No particular correlation could be reported between the size of the data set (i.e, number of chemicals) and the percentage of correctly predicted compounds. Therefore, we can say that the percent of test sets predicted correctly can be directly connected to the diversity of chemicals tested. The use of dissimilar compounds produces a good test for the algorithm. In general, structures that have a positive assay result for PPL induction are over-represented in our test set; therefore, data sets trained on more active than inactive compounds will predict active better than inactive.

The best predictivity ( ≈ 84%) was obtained by performing a RF classification on a set

of descriptors computed with the SES, and there is a very good similarity in the confusion matrix obtained:

a b ← classified as 25 7 | a = -1 4 33 | b = 1,

where only seven compounds are incorrectly predicted as actives and only four compounds wrongly predicted as inactives.

Except PM6, which with RF gives a constant value of 77% when performed either with the iso or the SES, all the other models generated are surface-dependent for both NB and RF algorithms. This specific problem could be due to the atomic parameters of PM6. However, when calculating the values of the MCC48, it appears that the MCC48 values obtained when using the RF classification on sets of descriptors generated either with the iso or the SES are 0.55 and 0.54, respectively. This difference of ≈ 0.01 is probably related to the difference between the two descriptor sets in predicting compounds as inducers or non- inducers of PPL, which leads to different predictivities for negative and positive compounds.

Models presenting good similarities in their confusion matrices are those in which

there is a small difference between the predictive powers of compounds that can be classified as inducers or non-inducers of PPL. Most of these models are specifically generated by

109

Page 124: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 4

performing the RF classification on sets of descriptors calculated with the SES. It clearly appears that the use of ML algorithms with high quality data or descriptor sets can lead to improved predictions. According to Svetnik et al.50, compared to linear trees, RF performs significantly better, and among others, provides the most appropriate approach for similar classification problems (no need for descriptors or variable selections), therefore, providing the specificity needed to select appropriate descriptors. Many research studies have been conducted on the prediction of PPL induction with different data sets. Could the use of a large data set produce better predictivity as mentoned by Lowe ET al.1? In this work, we used a standard 10-fold cross-validation available in weka. Increasing the number of cross-validations may have a significant effect on predictivity.

4.5 Conclusions

We presented a comparison of two ML techniques for the prediction of drug-induced

PPL, based on the calculated ParaSurf descriptors and the logPow. Focusing on the accuracies and the calculated values of the MCC48, the performance values for the descriptors in our data set clearly support the hypothesis that ParaSurf descriptors and the logPow contain sufficient information for ML algorithms to construct accurate and simple predictive models to determine if a given molecule is active or inactive. Although the RF algorithm does not require an extra step for the selection of appropriate descriptors, its application on descriptors calculated with the SES generally leads to obtaining robust models that evaluated on their respective test sets yield not only good predictive powers with overall accuracies generally higher or equal to 75% but also confusion matrices presenting good similarities. With RF, the use of the SES improves the predictive power of the models and corrects the similarity problem of the confusion matrix. RF produces the best models, with an average MCC48 of 0.57 in a 10-fold cross-validation. We obtained different accuracies for the different models, suggesting that some individual molecules might be hard to predict with some Hamiltonians. Furthermore, NB yields models that evaluated on their respective test sets are totally surface-dependent and also predict more correctly and consistently than RF. Strong in its ability to generate entirely surface-dependent classification models, the NB classifier is officially recognized as a van der Waals surface-dependent classifier. One of the singular points motivating the use of the Bayesian approach is that, due to the penalization of complex models51, overfitting and overtraining problems are considerably reduced. For a specific class (active or inactive), RF seems to be able to predict regardless of the distribution of elements within the class, unlike NB, which tends to predict relative to the proportion of items in the class. Based on the fact that the set of descriptors created here provides sufficient chemical information necessary to generate sufficiently robust models, we can say with some conviction that this work provides a new approach for predicting drugs that induce PPL. The scientific input of the results presented is their possible use as an indicator to study the performance of classification methods in real systems relevant to drug discovery. From our comparative study between NB and RF algorithms, it appears that RF is the more appropriate classifier for some QSAR studies.

110

Page 125: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Classification of Phospholipidosis

4.6 References 1. Lowe, R.; Glen, R. C.; Mitchell, J. B.O. Predicting Phospholipidosis Using Machine Learning.

Mol Pharmaceutics. 2010, 7 (5), 1708-1714.

2. Anderson, N.; Borlak, J. Drug-induced phospholipidosis. FEBS Lett. 2006, 580, 5533-5540. 3. Abe, A.; Hiraoka, M.; Shayman, J. A. A role for lysosomal phospholipase A2 in drug induced

phospholipidosis. Drug. Metab. Lett. 2007, 1, 49-53. 4. Vitovic, P.; Alakoskela, J. M.; Kinnunen, P. K. J. Assessment of drug-lipid complex

formation by a high-throughput Langmuirbalance and correlation to phospholipidosis. J. Med. Chem. 2008, 51, 1842-1848.

5. Ohlstein, E. H.; Ruffolo, R. R.; Elliot, J. D. Drug discovery in the next millennium. Ann. Rev.

Pharmacol. Toxicol. 2000, 40, 177-191. 6. Rodrigues, A. D. Preclinical drug metabolism in the age of high-throughput screening.

Pharm. Res. 1997, 14, 1504-1510. 7. Navia, M. A.; Chaturvedi, P. R. Design principles for orally bioavailable drugs. Drug. Disc.

Today. 1996, 1, 179-189. 8. Gaviraghi, G.; Barnaby, R. J.; Pellegatti, M. Pharmaco-kinetic challenges in lead optimization.

In Pharmacokinetic Optimization in Drug Research., Testa, B.; van de Waterbeemd, H.; Folkers, G.; Guy, R., Eds. Wiley-VCH: Basel, Switzerland, 2001; pp. 1-14.

9. Kennedy, T. Managing the drug discovery/development interface. Drug. Disc. Today. 1997,

2, 436-444. 10. Pelletier, D. J.; Gehlhaar, D.; Tilloy-Ellul, A.; Johnson, T. O.; Greene, N. Evaluation of a

published in silico model and construction of a novel Bayesian model for predicting phospholipidosis inducing potential. J. Chem. Inf. Model. 2007, 47, 1196-1205.

11. Ploemen, J. P.; Kelder, J.; Hafmans, T.; van de Sandt, H.; van Burgsteden, J. A.; Saleminki, P.

J.; and van Esch, E. Use of physicochemical calculation of pKa and ClogP to predict phospholipidosis-inducing potential: A case study with structural related piperazines. Exp. Toxicol. Pathol. 2004, 55, 347-355.

12. Sawada, H.; Takami, K.; Asahi, S. A toxicogenomic approach to drug-induced

phospholipidosis: Analysis of its induction mechanism and establishment of a novel in vitro screening system. Toxicol. Sci. 2005, 83, 282-292.

13. Atienzar, F.; Gerets, H.; Dufrane, S.; Tilmant, K.; Cornet, M.; Dhalluin, S.; Ruty, B.; Rose,

G.; Canning, M. Determination of phospholipidosis potential based on gene expression analysis in HepG2 cells. Toxicol. Sci. 2007, 96, 101-114.

14. Kasahara, T.; Tomita, K.; Murano, H.; Harada, T.; Tsubakimoto, K.; Ogihara, T.; Ohnishi, S.;

Kakinuma, C. Establishment of an in vitro high-throughput screening assay for detecting phospholipidosis-inducing potential. Toxicol. Sci. 2006, 90, 133-141.

15. Ivanciuc, O. Weka machine learning for predicting the phospholipidosis inducing potential.

Curr. Top. Med. Chem. 2008, 8, 1691-1709.

111

Page 126: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 4

16. Tomizawa, K.; Sugano, K.; Yamada, H.; Horii, I. Physicochemical and cell-based approach

for early screening of phospholipidosis-inducing potential. J. Toxicol. Sci. 2006, 31, 315-324. 17. Umesh, M. H.; Gottfried, W.; Regueiro-Ren, A.;Roumyana, Y.; John, P. C.; Stephen, P. A.

Phospholipidosis as a function of Basicity, Lipophilicity, and Volume of Distribution of Compounds. Chem. Res. Toxicol. 2010, 23, 749-755.

18. Nioi, P.; Perry, B. K.; Wang, E. J.; Gu, Y. Z.; Snyder, R. D. In vitro detection of drug-induced

phospholipidosis using gene expression and fluorescent phospholipid based methodologies. Toxicol. Sci. 2007, 99 (1), 162-173.

19. Hansch, C.; Fujita, T. ρ-σ-π analysis. A method for the correlation of biological activity and

chemical structure. J. Am. Chem. Soc. 1964, 86, 1616-1626. 20. Fujita,T.; Iwasa, J.; Hansch, C. A new substituent constant, π, derived from partition

coefficients. J. Am. Chem. Soc. 1964, 86, 5175-5180. 21. Bonchev, D.; Rouvray, D. H. Chemical Graph Theory. Introduction and Fundamentals.

Abacus Press/Gordon & Breach Science Publishers: New York, 1991. 22. Trinajstic, N. Chemical Graph Theory. CRC Press: Boca Raton, FL, 1992. 23. Ivanciuc, O. Graph Theory in chemistry. In Handbook of Chemoinformatics, Gasteiger, J.,

Ed. Wiley-VCH: Weinheim, 2003; Vol. 1, pp 103-138. 24. Bonchev, D. Information Theoretic Indices for Characterization of Chemical Structure.

Research Studies Press: Chichester, UK, 1983. 25. Balaban, A. T.; Ivanciuc, O. Historical development of topological indices. In Topological

Indices and Related Descriptors in QSAR and QSPR, Devillers, J.; Balaban, A. T., Eds. Gordon and Breach Science Publishers: Amsterdam, 1999; pp 21-57.

26. Ivanciuc, O. Topological indices. In Handbook of Chemoinformatics, Gasteiger, J., Ed.

Wiley-VCH: Weinheim, 2003; Vol. 3, pp 981-1003. 27. Kier, L. B.; Hall, L. H. Molecular Connectivity in Chemistry and Drug Research. Academic

Press: New York, 1976. 28. Kier, L. B.; Hall, L. H. Molecular Connectivity in Structure-Activity Analysis. Research

Studies Press: Letchworth, 1986. 29. Kier, L. B.; Hall, L. H. Molecular Structure Description. The Electrotopological State.

Academic Press: San Diego, 1999. 30. Ivanciuc, O. Electrotopological state indices. In Molecular Drug Properties. Measurement

and Prediction, Mannhold, R., Ed. Wiley-VCH: Weinheim, 2008; pp 85-109. 31. Todeschini, R.; Consonni, V. Descriptors from molecular geometry. In Handbook of

Chemoinformatics, Gasteiger, J., Ed. Wiley-VCH: Weinheim, 2003; Vol. 3, pp 1004-1033. 32. Jurs, P. Quantitative structure-property relationships. In Handbook of Chemoinformatics,

Gasteiger, J., Ed. Wiley-VCH: Weinheim, 2003; Vol. 3, pp 1314-1335.

112

Page 127: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Classification of Phospholipidosis

33. Ivanciuc, O. 3D QSAR models. In QSPR/QSAR Studies by Molecular Descriptors, Diudea, M. V., Ed. Nova Science Publishers: Huntington, NY, 2001; pp 233-280.

34. Zhang, S.; Golbraikh, A.; Oloff, S.; Kohn, H.; Tropsha, A. A novel automated lazy learning

QSAR (ALL-QSAR) approach: Method development, applications, and virtual screening of chemical databases using validated ALL-QSAR models. J. Chem. Inf. Model. 2006, 46, 1984-1995.

35. Coulombe, P.A.; Kan, F. W.; Bendayon, M. Introduction of a high-resolution cytochemical

method for studying the distribution of phospholipidosis in biological tissues. European Journal of Cell Biology 1988, 46 (3), 564-576.

36. Weininger, D. SMILES, a Chemical Language and Information System. Introduction to

Methodology and Encoding Rules. Journal of Chemical Information and Computational Sciences 1988, 28, 31-36.

37. CORINA 3D Structure Generator; Molecular Networks, GmbH: Erlangen, Germany, 2006. 38. Sadowski, J.; Gasteiger, J.; Klebe, G. Comparison of Automatic Three-Dimensional Model

Builders using 639 X-Ray structures. Journal of Chemical Information and Computational Sciences 1994, 34, 1000-1008.

39. Clark, T.; Alex, A.; Beck, A.; Burkhardt, F.; Chandrasekhar, J.; Gedeck, P.; Horn, A. H. C.;

Hutter, M.; Martin, B.; Rauhut, G.; Sauer, W.; Schindler, T.; Steinke, T. VAMP 8.2; accelrys Inc.: Erlangen: San Diego, USA, 2002.

40. ParaSurf10, CEPOS InSilico Ltd.: Erlangen, Germany, 2010.

41. Meyer, A. Y. The size of molecules. Chem. Soc. Rev. 1985, 15, 449-475.

42. Pan, Q.; Tai, X. –C. Model the Solvent-Excluded Surface of 3D Protein Molecular Structures

Using Geometric PDE-Based Level-Set Method. Commun. Comput. Phys. 2009, 6, 777-792. 43. Witten, I. H.; Frank, E. Data Mining: Practical Machine Learning Tools and Techniques, 2

ed. Morgan Kaufmann: San Francisco, 2005; p 525. 44. Reasor, M. J.; Kacew, S. Drug-induced phospholipidosis: Are there functional consequences?

Exp. Biol. Med. 2001, 226, 825-830. 45. Frank, E.; Trigg, L.; Holmes, G.; Witten, I. H. Naïve Bayes for regression. Machine Learning,

2000, 41 (1), pp 5-15. 46. L. Breiman, “Random Forests”, Machine Learning, 2001, 45, pp 5-32. 47. Linda.; Chatman, A.; Daniel, M.; Theodore, O.; Johnson.; Susan D, A. A Strategy for Risk

Management of Drug-Induced Phospholipidosis. Toxicol Pathol. 2009, Vol. 37, No. 7. 48. Matthews, B. W. Comparison of the predicted and observed secondary structure of T4 phage

lysozyme. Biochim. Biophys. Acta. 1975, 405, 442-451.

49. Leo, A.; Hansch, C.; Elkins, D. Partition Coefficients and their uses. Chem. Rev. 1971, 71 (6), 525-616.

113

Page 128: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Chapter 4

50. Svetnik, V.; Liaw, A.; Tong, C.; Culberson, J.; Sheridan, R. P.; Feuston, B. P. Random Forest:

A Classification and Regression Tool for Compound Classification and QSAR Modeling. J. Chem. Inf. Comput. Sci. 2003, 43, 1947-1958.

51. Michael, J. S.; John, O. M.; Ross, A. Mc.; David, A. W.; Frank, R. B.; Paul, A. S.

Comparison of Linear and Nonlinear Classification Algorithms for the Prediction of Drug and Chemical Metabolism by Human UDP-Glucuronosyltransferase Isoforms. J. Chem. Inf. Comput. Sci. 2003, 43, 2019-2024.

114

Page 129: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Appendix

Page 130: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Appendix

Table A1. Component terms used in the SIMNumber Term

1 )(rV 2 )(rV 3 [ ] 2

3)(rV

4 [ ]2)(rV 5 [ ] 2

5)(rV

6 [ ]3)(rV 7 )(rIEL 8 )(rIEL 9 [ ] 2

3)(rIEL

10 [ ]2)(rIEL 11 [ ] 2

5)(rIEL

12 [ ]3)(rIEL 13 )(rEAL 14 )(rEAL 15 [ ] 2

3)(rEAL

16 [ ]2)(rEAL 17 [ ] 2

5)(rEAL

18 [ ]3)(rEAL 19 )(rLα 20 )(rLα 21 [ ] 2

3)(rLα

22 [ ]2)(rLα 23 [ ] 2

5)(rLα

24 [ ]3)(rLα 25 )(rLη 26 )(rLη 27 [ ] 2

3)(rLη

28 [ ]2)(rLη 29 [ ] 2

5)(rLη

30 [ ]3)(rLη 31 )().( rIErV L

116

Page 131: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Appendix

32 )().( rIErV L 33 [ ] 2

3)().( rIErV L

34 [ ]2)().( rIErV L 35 [ ] 2

5)().( rIErV L

36 [ ]3)().( rIErV L 37 )().( rEArV L

38 )().( rEArV L 39 [ ] 2

3)().( rEArV L

40 [ ]2)().( rEArV L 41 [ ] 2

5)().( rEArV L

42 [ ]3)().( rEArV L 43 )().( rrV Lα

44 )().( rrV Lα 45 [ ] 2

3)().( rrV Lα

46 [ ]2)().( rrV Lα 47 [ ] 2

5)().( rrV Lα

48 [ ]3)().( rrV Lα 49 )().( rrV Lη 50 )().( rrV Lη 51 [ ] 2

3)().( rrV Lη

52 [ ]2)().( rrV Lη 53 [ ] 2

5)().( rrV Lη

54 [ ]3)().( rrV Lη 55 )().( rEArIE LL 56 )().( rEArIE LL 57 [ ] 2

3)().( rEArIE LL

58 [ ]2)().( rEArIE LL 59 [ ] 2

5)().( rEArIE LL

60 [ ]3)().( rEArIE LL 61 )().( rrIE LL α 62 )().( rrIE LL α

117

Page 132: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Appendix

63 [ ] 2

3)().( rrIE LL α

64 [ ]2)().( rrIE LL α 65 [ ] 2

5)().( rrIE LL α

66 [ ]3)().( rrIE LL α 67 )().( rrIE LL η 68 )().( rrIE LL η 69 [ ] 2

3)().( rrIE LL η

70 [ ]2)().( rrIE LL η 71 [ ] 2

5)().( rrIE LL η

72 [ ]3)().( rrIE LL η 73 )().( rrEA LL α 74 )().( rrEA LL α 75 [ ] 2

3)().( rrEA LL α

76 [ ]2)().( rrEA LL α 77 [ ] 2

5)().( rrEA LL α

78 [ ]3)().( rrEA LL α 79 )().( rrEA LL η 80 )().( rrEA LL η 81 [ ] 2

3)().( rrEA LL η

82 [ ]2)().( rrEA LL η 83 [ ] 2

5)().( rrEA LL η

84 [ ]3)().( rrEA LL η 85 )().( rr LL ηα 86 )().( rr LL ηα 87 [ ] 2

3)().( rr LL ηα

88 [ ]2)().( rr LL ηα 89 [ ] 2

5)().( rr LL ηα

90 [ ]3)().( rr LL ηα 91 )().().( rEArIErV LL

92 )().().( rEArIErV LL 93 [ ] 2

3)().().( rEArIErV LL

94 [ ]2)().().( rEArIErV LL

118

Page 133: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Appendix

95 [ ] 25

)().().( rEArIErV Ll 96 [ ]3)().().( rEArIErV LL 97 )().().( rrIErV LL α 98 )().().( rrIErV LL α 99 [ ] 2

3)().().( rrIErV LL α

100 [ ]2)().().( rrIErV LL α 101 [ ] 2

5)().().( rrIErV LL α

102 [ ]3)().().( rrIErV LL α 103 )().().( rrIErV LL η

104 )().().( rrIErV LL η 105 [ ] 2

3)().().( rrIErV LL η

106 [ ]2)().().( rrIErV LL η 107 [ ] 2

5)().().( rrIErV LL η

108 [ ]3)().().( rrIErV LL η 109 )().().( rrEArV LL α 110 )().().( rrEArV LL α 111 [ ] 2

3)().().( rrEArV LL α

112 [ ]2)().().( rrEArV LL α 113 [ ] 2

5)().().( rrEArV LL α

114 [ ]3)().().( rrEArV LL α 115 )().().( rrEArV LL η 116 )().().( rrEArV LL η 117 [ ] 2

3)().().( rrEArV LL η

118 [ ]2)().().( rrEArV LL η 119 [ ] 2

5)().().( rrEArV LL η

120 [ ]3)().().( rrEArV LL η 121 )().().( rrEArIE LLL α 122 )().().( rrEArIE LLL α 123 [ ] 2

3)().().( rrEArIE LLL α

124 [ ]2)().().( rrEArIE LLL α 125 [ ] 2

5)().().( rrEArIE LLL α

119

Page 134: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Appendix

126 [ ]3)().().( rrEArIE LLL α 127 )().().( rrEArIE LLL η 128 )().().( rrEArIE LLL η 129 [ ] 2

3)().().( rrEArIE LLL η

130 [ ]2)().().( rrEArIE LLL η 131 [ ] 2

5)().().( rrEArIE LLL η

132 [ ]3)().().( rrEArIE LLL η 133 )().().( rrrIE LLL ηα 134 )().().( rrrIE LLL ηα 135 [ ] 2

3)().().( rrrIE LLL ηα

136 [ ]2)().().( rrrIE LLL ηα 137 [ ] 2

5)().().( rrrIE LLL ηα

138 [ ]3)().().( rrrIE LLL ηα 139 )().().( rrrEA LLL ηα

140 )().().( rrrEA LLL ηα 141 [ ] 2

3)().().( rrrEA LLL ηα

142 [ ]2)().().( rrrEA LLL ηα 143 [ ] 2

5)().().( rrrEA LLL ηα

144 [ ]3)().().( rrrEA LLL ηα 145 )().().().( rrrEArV LLL ηα 146 )().().().( rrrEArV LLL ηα 147 [ ] 2

3)().().().( rrrEArV LLL ηα

148 [ ]2)().().().( rrrEArV LLL ηα 149 [ ] 2

5)().().().( rrrEArV LLL ηα

150 [ ]3)().().().( rrrEArV LLL ηα

120

Page 135: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Appendix

Table A2. )( 2OHGsolvΔ (kcal mol 1− ) for the entire data set calculated with AM1 and the iso No Compound Exp Calc 1 Methane 2 0.8524 2 Ethane 1.83 0.981 3 Propane 1.96 1.259 4 Cyclopropane 0.75 0.955 5 2-Methylpropane 2.32 1.645 6 2,2-Dimethylpropane 2.5 2.079 7 n-Butane 2.08 1.521 8 2,2-Dimethylbutane 2.59 2.252 9 Cyclopentane 1.2 0.801 10 n-Pentane 2.33 1.747 11 2-Methylpentane 2.52 2.098 12 3-Methylpentane 2.51 2.059 13 2,4-Dimethylpentane 2.88 2.482 14 2,2,4-Trimethylpentane 2.85 2.854 15 Methylcyclopentane 1.6 1.334 16 n-Hexane 2.49 1.997 17 Cyclohexane 1.23 1.678 18 Methylcyclohexane 1.71 2.04

19 cis-1,2-

Dimethylcyclohexane 1.58 2.305 20 n-Heptane 2.62 2.228 21 n-Octane 2.89 2.46 22 Ethylene 1.27 0.6656 23 Propylene 1.27 0.779 24 2-Methylpropene 1.16 0.823 25 1-Butene 1.38 0.986 26 2-Methyl-2-butene 1.31 0.966 27 3-Methyl-1-butene 1.83 1.416 28 1-Pentene 1.66 1.254 29 trans-2-Pentene 1.34 1.125 30 4-Methyl-1-pentene 1.91 1.732 31 Cyclopentene 0.56 0.1955 32 1-Hexene 1.68 1.506 33 Cyclohexene 0.37 0.648 34 trans-2-Heptene 1.66 1.62 35 1-Methylcyclohexene 0.67 0.756 36 1-Octene 2.17 1.971 37 1,3-Butadiene 0.61 0.252 38 2-Methyl-1,3-butadiene 0.68 0.402

39 2,3-Dimethyl-1,3-

butadiene 0.4 0.505 40 1,4-Pentadiene 0.94 1.031 41 1,5-Hexadiene 1.01 1.153 42 Acetylene -0.01 0.283 43 Propyne -0.31 0.016

121

Page 136: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Appendix

44 1-Butyne -0.16 0.298 45 1-Pentyne 0.01 0.567 46 1-Hexyne 0.29 0.765 47 1-Heptyne 0.6 0.977 48 1-Octyne 0.71 1.179 49 1-Nonyne 1.05 1.373 50 Butenyne 0.04 -0.296 51 Benzene -0.87 -1.2720652 Toluene -0.89 -1.083353 1,2,4-Trimethylbenzene -0.86 -0.875554 Ethylbenzene -0.8 -0.514555 m-Xylene -0.84 -0.982456 o-Xylene -0.9 -1.025457 p-Xylene -0.8 -0.941258 Propylbenzene -0.53 -0.282 59 Butylbenzene -0.4 0.041 60 t-Butylbenzene -0.44 -0.043 61 t-Amylbenzene -0.18 0.152 62 Naphtalene -2.39 -2.858663 Anthracene -4.23 -4.313 64 Phenanthrene -3.95 -4.188 65 Acenaphtene -3.15 -3.101666 p-Chlorotoluene -1.92 -2.375867 Fluoromethane -0.22 -0.543 68 1,1-Difluoroethane -0.11 -0.232 69 Trifluoromethane -0.81 -0.076 70 Tetrafluoromethane 3.11 2.384 71 Hexafluoroethane 3.94 3.046 72 Octafluoropropane 4.28 5.498 73 Fluorobenzene -0.78 -1.9676

74 2-Chloro-1,1,1-Trifluoroethane 0.05 -0.723

75 Chlorofluoromethane -0.77 -0.366 76 Chlorodifluoromethane -0.5 0.034 77 Chlorotrifluoromethane 2.52 0.866 78 Dichlorodifluoromethane 1.69 1.491 79 Fluorotrichloromethane 0.82 0.10197

80 1,1,2-Trichloro-1,2,2-

trifluoroethane 1.77 -0.2529

81 1,1,2,2-

Tetrachlorodifluoroethane 0.82 1.488 82 Chloropentafluoroethane 2.87 3.591

83 1,1-

Dichlorotetrafluoroethane 2.51 2.085

84 1,2-

Dichlorotetrafluoroethane 2.32 0.5602 85 1-Bromo-1-chloro-2,2,2- -0.13 -1.5307

122

Page 137: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Appendix

trifluoroethane86 Bromotrifluoromethane 1.79 -2.442

87 1-Bromo-1,2,2,2-tetrafluoroethane 0.52 -1.0819

88 Chloromethane -0.56 -0.830789 Dichloromethane -1.41 -0.501390 Trichloromethane -1.07 -0.044891 Tetrachloromethane 0.1 -1.031 92 Chloroethane -0.63 -0.760393 1,1-Dichloroethane -0.85 -0.410694 (E)-1,2-Dichloroethane -1.73 -0.946595 1,1,1-Trichloroethane -0.25 -0.054296 1,1,2-Trichloroethane -1.95 -0.354497 1,1,1,2-Tetrachloroethane -1.15 0.653 98 1,1,2,2-Tetrachloroethane -2.36 1.392 99 Pentachloroethane -1.36 0.2649 100 Hexachloroethane -1.41 -0.4145101 1-Chloropropane -0.27 -0.8321102 2-Chloropropane -0.25 0.048 103 1,2-Dichloropropane -1.25 -0.489 104 1,3-Dichloropropane -1.9 -1.4707105 1-Chlorobutane -0.14 -0.7407106 2-Chlorobutane 0.07 0.394 107 1,1-Dichlorobutane -0.7 -0.4029108 1-Chloropentane -0.07 -0.6889109 2-Chloropentane 0.07 0.58 110 3-Chloropentane 0.04 0.467 111 Chloroethylene -0.59 -1.0348112 cis-1,2-Dichloroethylene -1.17 -1.2622

113 trans-1,2-

Dichloroethylene -0.76 -1.3745114 Trichloroethylene -0.44 -1.3712115 Tetrachloroethylene 0.05 -1.67 116 Chlorobenzene -1.12 -2.4015117 o-Chlorotoluene -1.15 -1.47909118 1,2-Dichlorobenzene -1.36 -3.163 119 1,3-Dichlorobenzene -0.98 -3.315 120 1,4-Dichlorobenzene -1.01 -3.829 121 2,2'-Dichlorobiphenyl -2.73 -1.8674122 2,3-Dichlorobiphenyl -2.45 -3.2653123 2,2,3'-Trichlorobiphenyl -1.99 -2.6484124 Bromotrichloromethane -0.93 -0.5003125 1-Chloro-2-bromoethane -1.95 -1.06448126 Bromomethane -0.82 -1.4264127 Dibromomethane -2.11 -3.307 128 Tribromomethane -2.13 -2.334 129 Bromoethane -0.7 -0.9601

123

Page 138: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Appendix

130 1,2-Dibromoethane -2.1 -2.638 131 1-Bromopropane -0.56 -1.0035132 2-Bromopropane -0.48 -1.109 133 1,2-Dibromopropane -1.94 -0.06 134 1,3-Dibromopropane -1.96 -1.658

135 1-Bromo-2-

methylpropane -0.03 -0.3031136 1-Bromobutane -0.41 -0.8639137 1-Bromoisobutane -0.03 0.353 138 1-Bromo-3-methylbutane 0.21 -0.8331139 1-Bromopentane -0.08 -0.77535140 3-Bromopropene -0.86 -1.2995141 Bromobenzene -1.46 -2.0111142 1,4-Dibromobenzene -2.3 -1.3769143 p-Bromotoluene -1.39 -1.9557144 1-Bromo-2-ethylbenzene -1.19 -0.296 145 o-Bromocumene -0.85 -1.6731146 Methanol -5.11 -4.736 147 Ethanol -5.01 -4.786 148 Ethylene glycol -9.3 -6.5 149 1-Propanol -4.83 -4.448 150 2-Propanol -4.76 -4.731

151 1,1,1-Trifluoro-2-

propanol -4.16 -3.172

152 2,2,3,3-

Tetrafluoropropanol -4.88 -2.869

153 2,2,3,3,3-

Pentafluoropropanol -4.15 -4.318 154 Hexafluoro-2-propanol -3.76 -1.554 155 2-Methyl-1-propanol -4.52 -3.636 156 1-Butanol -4.72 -4.221 157 2-Butanol -4.58 -2.0371158 t-Butyl alcohol -4.51 -4.69 159 2-Methyl-1-butanol -4.42 -1.9162160 3-Methyl-1-butanol -4.42 -3.829 161 2-Methyl-2-butanol -4.43 -2.553 162 2,3-Dimethyl-1-butanol -3.91 -2.762 163 1-Pentanol -4.47 -4.036 164 2-Pentanol -4.39 -2.618 165 3-Pentanol -4.35 -3.478 166 2-Methyl-1-pentanol -3.93 -2.876 167 2-Methyl-2-pentanol -3.93 -2.1991168 2-Methyl-3-pentanol -3.89 -2.548 169 4-Methyl-2-pentanol -3.74 -2.2558170 Cyclopentanol -5.49 -4.408 171 1-Hexanol -4.36 -3.762 172 3-Hexanol -4.07 -3.066

124

Page 139: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Appendix

173 Cyclohexanol -5.48 -4.05 174 4-Heptanol -4.01 -2.889 175 Cycloheptanol -5.49 -3.682 176 1-Heptanol -4.24 -3.79 177 1-Octanol -4.09 -3.715 178 Allyl alcohol -5.03 -5.25 179 Phenol -6.62 -5.426 180 4-Bromophenol -7.13 -5.649 181 4-t-Butylphenol -5.92 -4.062 182 2-Cresol -5.87 -5.034 183 3-Cresol -5.49 -5.349 184 4-Cresol -6.14 -5.328 185 2,2,2-Trifluoroethanol -4.31 -5.638 186 p-Bromophenol -7.13 -5.649 187 2-Methoxyethanol -6.77 -5.193 188 Dimethoxymethane -2.93 -2.0455189 Methyl propyl ether -1.66 -3.146 190 Methyl isopropyl ether -2.01 -2.1025191 Methyl t-butyl ether -2.21 -1.0785192 Diethyl ether -1.76 -3.473 193 Ethyl propyl ether -1.81 -3.185 194 Dipropyl ether -1.16 -2.874 195 Diisopropyl ether -0.53 -1.1811196 Di-n-butyl ether -0.83 -2.411 197 Tetrahydrofuran -3.47 -3.974 198 2-Methyltetrahydrofuran -3.3 -3.599 199 Anisole -2.45 -3.789 200 Ethyl phenyl ether -4.28 -3.857 201 1,1-Diethoxyethane -3.27 -3.583 202 1,2-Dimethoxyethane -4.83 -3.667 203 1,2-Diethoxyethane -3.52 -3.334 204 1,3-Dioxolane -4.1 -3.522 205 1,4-Dioxane -5.05 -5.707

206 2,2,2-Trifluoroethyl vinyl

ether -0.12 -0.26

207

1-Chloro-2,2,2-trifluoroethyl

difluoromethyl ether 0.11 -1.247 208 Acetaldehyde -3.5 -3.8434209 Propanal -3.44 -3.4101210 Butanal -3.18 -3.731 211 Pentanal -3.03 -3.2441212 Hexanal -2.81 -3.5882213 Heptanal -2.67 -2.9576214 Octanal -2.29 -3.3372215 Nonanal -2.08 -2.7208216 trans-2-Butenal -4.23 -4.946

125

Page 140: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Appendix

217 trans-2-Hexenal -3.68 -4.271 218 trans-2-Octenal -3.44 -3.6682

219 trans,trans-2,4-

Hexadienal -4.63 -5.096 220 Benzaldehyde -4.02 -4.703 221 m-Hydroxybenzaldehyde -9.51 -8.772 222 p-Hydroxybenzaldehyde -10.48 -9.164 223 Acetone -3.85 -4.258 224 2-Butanone -3.64 -4.269 225 3-Methyl-2-butanone -3.24 -3.992 226 3,3-Dimethylbutanone -2.89 -3.681 227 2-Pentanone -3.53 -3.763 228 3-Pentanone -3.41 -2.9221229 4-Methyl-2-pentanone -3.06 -3.23

230 2,4-Dimethyl-3-

pentanone -2.74 -2.22485231 Cyclopentanone -4.68 -4.163 232 2-Hexanone -3.29 -3.6473233 2-Heptanone -3.04 -3.2754234 4-Heptanone -2.93 -2.7046235 2-Octanone -2.88 -3.1327236 2-Nonanone -2.49 -2.904 237 5-Nonanone -2.67 -4.155 238 2-Undecanone -2.16 -2.4185239 Acetophenone -4.58 -5.172 240 Acetic acid -6.7 -6.535 241 Propionic acid -6.47 -6.747 242 Butyric acid -6.36 -6.021 243 Pentanoic acid -6.16 -6.02 244 Hexanoic acid -6.21 -5.607

245

4-Amino-3,5,6-trichloropyridine-2-

carboxylic acid -11.96 -10.202246 Methyl formate -2.78 -3.62848247 Ethyl formate -2.65 -2.8942248 Propyl formate -2.48 -2.6296249 Methyl acetate -3.31 -3.40406250 Isopropyl formate -2.02 -3.34648251 Isobutyl formate -2.22 -2.8746252 Isoamyl formate -2.13 -2.5209253 Ethyl acetate -3.1 -3.43298254 Propyl acetate -2.86 -4.0639255 Isopropyl acetate -2.65 -3.9699256 Butyl acetate -2.55 -3.7066257 Isobutyl acetate -2.36 -3.2861258 Amyl acetate -2.45 -3.5504259 Isoamyl acetate -2.21 -3.5564

126

Page 141: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Appendix

260 Hexyl acetate -2.26 -3.39161261 Methyl propionate -2.93 -3.9529262 Ethyl propionate -2.8 -3.8199263 Propyl propionate -2.54 -3.4933264 Isopropyl propionate -2.22 -3.3816265 Amyl propionate -1.99 -2.95397266 Methyl butyrate -2.83 -3.5128267 Ethyl butyrate -2.5 -3.2304268 Propyl butyrate -2.28 -3.0632269 Methyl pentanoate -2.57 -3.6008270 Ethyl pentanoate -2.52 -3.3553271 Methyl hexanoate -2.49 -3.06611272 Ethyl heptanoate -2.3 -3.01596273 Methyl octanoate -2.04 -2.5765274 Methyl benzoate -4.28 -7.585 275 Methylamine -4.56 -4.88 276 Ethylamine -4.5 -3.977 277 Propylamine -4.39 -3.957 278 Butylamine -4.29 -3.712 279 Pentylamine -4.1 -3.445 280 Hexylamine -4.03 -3.274 281 Dimethylamine -4.29 -3.78 282 Diethylamine -4.07 -3.312 283 Dipropylamine -3.66 -2.728 284 Dibutylamine -3.31 -2.507 285 Trimethylamine -3.23 -3.457 286 Triethylamine -3.03 -2.717 287 Azetidine -5.56 -3.865 288 Piperazine -7.38 -9.97 289 N,N'-Dimethylpiperazine -7.58 -8.016 290 N-Methylpiperazine -7.77 -6.872 291 Aniline -5.49 -6.56

292 1,1-Dimethyl-3-

phenylurea -9.63 -10.28 293 N,N-Dimethylaniline -3.58 -6.278 294 Ethylenediamine -9.75 -9.839 295 Hydrazine -9.3 -8.355 296 2-Methoxy-1-ethanamine -6.55 -6.846 297 Morpholine -7.17 -6.925 298 N-Methylmorpholine -6.34 -5.822 299 N-Methylpyrrolidine -3.98 -3.829 300 N-Methylpiperidine -3.89 -3.572 301 Pyrrolidine -5.48 -3.111 302 Piperidine -5.11 -4.442 303 Pyridine -4.7 -4.119 304 2-Methylpyridine -4.63 -3.518 305 3-Methylpyridine -4.77 -3.562

127

Page 142: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Appendix

306 4-Methylpyridine -4.94 -4.049 307 2-Ethylpyridine -4.33 -3.0412308 3-Ethylpyridine -4.6 -3.34 309 4-Ethylpyridine -4.74 -3.832 310 2,3-Dimethylpyridine -4.83 -3.4115311 2,4-Dimethylpyridine -4.86 -3.5107312 2,5-Dimethylpyridine -4.72 -3.4217313 2,6-Dimethylpyridine -4.6 -3.3403314 3,4-Dimethylpyridine -5.22 -3.643 315 3,5-Dimethylpyridine -4.84 -3.943 316 2-Methylpyrazine -5.52 -5.398 317 2-Ethylpyrazine -5.46 -5.142 318 2-Isobutylpyrazine -5.05 -4.393

319 2-Ethyl-3-

methoxypyrazine -4.4 -5.577

320 2-Isobutyl-3-

methoxypyrazine -3.68 -2.9336321 9-Methyladenine -13.6 -13.59 322 1-Methylthymine -10.4 -10.303323 Methylimidazole -10.25 -8.597 324 N-Propylguanidine -10.92 -10.564325 Acetonitrile -3.89 -2.2333326 Propionitrile -3.85 -2.23622327 Butyronitrile -3.64 -2.1477328 Benzonitrile -4.1 -4.308 329 2,6-Dichlorobenzonitrile -5.22 -5.581

330 3,5-Dibromo-4-

hydroxybenzonitrile -9 -4.716 331 N,N-Dimethylformamide -4.9 -5.3387332 N-Methylformamide -10 -9.728 333 Acetamide -9.71 -10.934334 (E)-N-Methylacetamide -10 -8.721 335 (Z)-N-Methylacetamide -10 -9.189 336 Propionamide -9.42 -10.428337 Methanethiol -1.24 -1.889 338 Ethanethiol -1.3 -1.5487339 1-Propanethiol -1.05 -1.4264340 Thiophenol -2.55 -3.274 341 Thioanisole -2.73 -2.984 342 Dimethyl sulfide -1.54 -1.821 343 Diethyl sulfide -1.43 -1.143 344 Methyl ethyl sulfide -1.49 -1.4987345 Dipropyl sulfide -1.27 -1.1159

346 2,2'-Dichlorodiethyl

sulfide -3.92 -2.6346347 Dimethyl disulfide -1.83 -3.355 348 Diethyl disulfide -1.63 -2.575

128

Page 143: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Appendix

349 Trimethyl phosphate -8.7 -7.6882350 Triethyl phosphate -7.8 -7.6335351 Tripropyl phosphate -6.1 -5.865

352 2,2-Dichloroethenyl dimethyl phosphate -6.61 -5.229

353

Dimethyl-5-(4-chloro)-bicyclo[3.2.0]-heptyl

phosphate -7.28 -5.993

354

o-Ethyl-o'-(4-bromo-2-chlorophenyl) S-propyl

phosphorothioate -4.09 -6.316 355 Hydrochinone -10.77 -9.466 356 1,2,3-Trimethoxybenzene -5.4 -3.8749357 1,2-Benzenediole -7.62 -8.649 358 1,3-Benzenediole -9.67 -9.133 359 o-Phenylenediamine -7.19 -10.124360 m-Phenylenediamine -10.26 -12.101361 2-Methylaniline -5.56 -6.495 362 N-Methylaniline -4.68 -5.918 363 Acetylene anion -73 -73.669364 Protonated methanol -87 -83.049365 Protonated dimethyl ether -70 -69.636366 Protonated 2-propanol -64 -62.664367 Oxonium ion -105 -104.865368 Methanolate ion -98 -97.655369 Formylate ion -77 -79.832370 Dimethyl ether carbanion -81 -75.585371 Phenolate ion -75 -77.695372 Toluene carbanion -59 -59.715373 Hydrogen peroxide anion -101 -98.014374 Methyl ammonium ion -73 -71.676

375 Protonated N-

Methylmethanamine -66 -64.57

376 Protonated N,N-

Dimethylmethanamine -59 -58.822377 Pyridinium ion -58 -60.615378 Ammonium ion -81 -85.215379 Acetonitrile carbanion -75 -73.352380 Amide ion -95 -97.509381 Azide ion -74 -76.391382 Methylsulfonium ion -74 -72.051

383 Protonated Dimethyl

sulfide -61 -64.492384 1-Propanethiolate anion -76 -76.615385 Thiophenolate ion -65 -64.2845

129

Page 144: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Appendix

Table A3. )( 2OHGsolvΔ (kcal mol 1− ) for the neutral compounds calculated with AM1 and the iso

No Compound Exp Calc 1 Methane 2 0.80712 Ethane 1.83 0.9013 Propane 1.96 1.1374 Cyclopropane 0.75 0.9145 2-Methylpropane 2.32 1.4756 2,2-Dimethylpropane 2.5 1.927 n-Butane 2.08 1.368 2,2-Dimethylbutane 2.59 2.0679 Cyclopentane 1.2 0.632810 n-Pentane 2.33 1.57111 2-Methylpentane 2.52 1.88712 3-Methylpentane 2.51 1.84313 2,4-Dimethylpentane 2.88 2.22214 2,2,4-Trimethylpentane 2.85 2.60815 Methylcyclopentane 1.6 1.0216 n-Hexane 2.49 1.79717 Cyclohexane 1.23 1.38518 Methylcyclohexane 1.71 1.73119 cis-1,2-Dimethylcyclohexane 1.58 1.97520 n-Heptane 2.62 2.01621 n-Octane 2.89 2.21522 Ethylene 1.27 0.609423 Propylene 1.27 0.71424 2-Methylpropene 1.16 0.82225 1-Butene 1.38 0.86726 2-Methyl-2-butene 1.31 0.95827 3-Methyl-1-butene 1.83 1.2128 1-Pentene 1.66 1.09429 trans-2-Pentene 1.34 0.97230 4-Methyl-1-pentene 1.91 1.51631 Cyclopentene 0.56 0.181332 1-Hexene 1.68 1.32933 Cyclohexene 0.37 0.56534 trans-2-Heptene 1.66 1.40535 1-Methylcyclohexene 0.67 0.70936 1-Octene 2.17 1.74937 1,3-Butadiene 0.61 0.245338 2-Methyl-1,3-butadiene 0.68 0.43139 2,3-Dimethyl-1,3-butadiene 0.4 0.55240 1,4-Pentadiene 0.94 0.79341 1,5-Hexadiene 1.01 0.95342 Acetylene -0.01 -0.136743 Propyne -0.31 -0.31644 1-Butyne -0.16 -0.054

130

Page 145: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Appendix

45 1-Pentyne 0.01 0.21946 1-Hexyne 0.29 0.40547 1-Heptyne 0.6 0.63448 1-Octyne 0.71 0.84649 1-Nonyne 1.05 1.04250 Butenyne 0.04 -0.611951 Benzene -0.87 -1.215752 Toluene -0.89 -1.021953 1,2,4-Trimethylbenzene -0.86 -0.70354 Ethylbenzene -0.8 -0.663355 m-Xylene -0.84 -0.885156 o-Xylene -0.9 -0.846757 p-Xylene -0.8 -0.871158 Propylbenzene -0.53 -0.39659 Butylbenzene -0.4 -0.16260 t-Butylbenzene -0.44 0.00861 t-Amylbenzene -0.18 0.1862 Naphtalene -2.39 -2.823563 Anthracene -4.23 -4.35264 Phenanthrene -3.95 -4.3265 Acenaphtene -3.15 -3.206766 p-Chlorotoluene -1.92 -2.060467 Fluoromethane -0.22 -2.164968 1,1-Difluoroethane -0.11 -0.98969 Trifluoromethane -0.81 -1.07470 Tetrafluoromethane 3.11 1.66971 Hexafluoroethane 3.94 2.92172 Octafluoropropane 4.28 4.36273 Fluorobenzene -0.78 -2.6774 2-Chloro-1,1,1-trifluoroethane 0.05 -0.27875 Chlorofluoromethane -0.77 -0.80276 Chlorodifluoromethane -0.5 0.26277 Chlorotrifluoromethane 2.52 2.62278 Dichlorodifluoromethane 1.69 2.01879 Fluorotrichloromethane 0.82 0.642680 1,1,2-Trichloro-1,2,2-trifluoroethane 1.77 3.27281 1,1,2,2-Tetrachlorodifluoroethane 0.82 1.11182 Chloropentafluoroethane 2.87 3.45683 1,1-Dichlorotetrafluoroethane 2.51 2.21184 1,2-Dichlorotetrafluoroethane 2.32 3.91585 1-Bromo-1-chloro-2,2,2-trifluoroethane -0.13 -0.482186 Bromotrifluoromethane 1.79 0.81587 1-Bromo-1,2,2,2-tetrafluoroethane 0.52 0.34188 Chloromethane -0.56 -1.187889 Dichloromethane -1.41 -0.641190 Trichloromethane -1.07 -0.632691 Tetrachloromethane 0.1 -0.999

131

Page 146: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Appendix

92 Chloroethane -0.63 -1.194293 1,1-Dichloroethane -0.85 -0.933994 (E)-1,2-Dichloroethane -1.73 -1.575795 1,1,1-Trichloroethane -0.25 -1.259796 1,1,2-Trichloroethane -1.95 -1.174697 1,1,1,2-Tetrachloroethane -1.15 -1.058798 1,1,2,2-Tetrachloroethane -2.36 -0.25799 Pentachloroethane -1.36 -1.0926100 Hexachloroethane -1.41 -0.929101 1-Chloropropane -0.27 -0.98153102 2-Chloropropane -0.25 0.321103 1,2-Dichloropropane -1.25 -0.7996104 1,3-Dichloropropane -1.9 -1.60189105 1-Chlorobutane -0.14 -0.84263106 2-Chlorobutane 0.07 0.511107 1,1-Dichlorobutane -0.7 -0.6147108 1-Chloropentane -0.07 -0.6884109 2-Chloropentane 0.07 0.708110 3-Chloropentane 0.04 -0.4214111 Chloroethylene -0.59 -0.594112 cis-1,2-Dichloroethylene -1.17 -0.8462113 trans-1,2-Dichloroethylene -0.76 -0.8634114 Trichloroethylene -0.44 -1.1249115 Tetrachloroethylene 0.05 -1.429116 Chlorobenzene -1.12 -2.0034117 o-Chlorotoluene -1.15 -1.3306118 1,2-Dichlorobenzene -1.36 -2.2624119 1,3-Dichlorobenzene -0.98 -2.2158120 1,4-Dichlorobenzene -1.01 -2.548121 2,2'-Dichlorobiphenyl -2.73 -2.51954122 2,3-Dichlorobiphenyl -2.45 -3.2246123 2,2,3'-Trichlorobiphenyl -1.99 -3.1211124 Bromotrichloromethane -0.93 -1.661125 1-Chloro-2-bromoethane -1.95 -2.1126 Bromomethane -0.82 0.0297127 Dibromomethane -2.11 -1.646128 Tribromomethane -2.13 -2.796129 Bromoethane -0.7 -0.80407130 1,2-Dibromoethane -2.1 -1.998131 1-Bromopropane -0.56 -0.57778132 2-Bromopropane -0.48 1.318133 1,2-Dibromopropane -1.94 -1.3061134 1,3-Dibromopropane -1.96 -1.8931135 1-Bromo-2-methylpropane -0.03 1.136136 1-Bromobutane -0.41 -0.5448137 1-Bromoisobutane -0.03 -0.1121138 1-Bromo-3-methylbutane 0.21 -0.1958

132

Page 147: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Appendix

139 1-Bromopentane -0.08 -0.4140 3-Bromopropene -0.86 -0.4982141 Bromobenzene -1.46 -1.9521142 1,4-Dibromobenzene -2.3 -1.3538143 p-Bromotoluene -1.39 -2.1592144 1-Bromo-2-ethylbenzene -1.19 0.492145 o-Bromocumene -0.85 -0.6293146 Methanol -5.11 -4.897147 Ethanol -5.01 -5.078148 Ethylene glycol -9.3 -6.527149 1-Propanol -4.83 -4.73150 2-Propanol -4.76 -4.893151 1,1,1-Trifluoro-2-propanol -4.16 -3.223152 2,2,3,3-Tetrafluoropropanol -4.88 -3.7019153 2,2,3,3,3-Pentafluoropropanol -4.15 -5.353154 Hexafluoro-2-propanol -3.76 -1.908155 2-Methyl-1-propanol -4.52 -4.372156 1-Butanol -4.72 -4.502157 2-Butanol -4.58 -2.1465158 t-Butyl alcohol -4.51 -4.798159 2-Methyl-1-butanol -4.42 -2.2947160 3-Methyl-1-butanol -4.42 -4.064161 2-Methyl-2-butanol -4.43 -2.759162 2,3-Dimethyl-1-butanol -3.91 -3.582163 1-Pentanol -4.47 -4.336164 2-Pentanol -4.39 -2.595165 3-Pentanol -4.35 -3.785166 2-Methyl-1-pentanol -3.93 -3.687167 2-Methyl-2-pentanol -3.93 -2.391168 2-Methyl-3-pentanol -3.89 -2.893169 4-Methyl-2-pentanol -3.74 -2.281170 Cyclopentanol -5.49 -4.634171 1-Hexanol -4.36 -4.053172 3-Hexanol -4.07 -3.367173 Cyclohexanol -5.48 -4.373174 4-Heptanol -4.01 -3.216175 Cycloheptanol -5.49 -3.883176 1-Heptanol -4.24 -4.176177 1-Octanol -4.09 -4.098178 Allyl alcohol -5.03 -5.842179 Phenol -6.62 -5.857180 4-Bromophenol -7.13 -6.401181 4-t-Butylphenol -5.92 -4.125182 2-Cresol -5.87 -4.862183 3-Cresol -5.49 -5.797184 4-Cresol -6.14 -5.809185 2,2,2-Trifluoroethanol -4.31 -6.47

133

Page 148: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Appendix

186 p-Bromophenol -7.13 -6.401187 2-Methoxyethanol -6.77 -4.719188 Dimethoxymethane -2.93 -1.5904189 Methyl propyl ether -1.66 -2.8190 Methyl isopropyl ether -2.01 -1.644191 Methyl t-butyl ether -2.21 -0.6949192 Diethyl ether -1.76 -3.236193 Ethyl propyl ether -1.81 -2.925194 Dipropyl ether -1.16 -2.579195 Diisopropyl ether -0.53 -1.0426196 Di-n-butyl ether -0.83 -2.1359197 Tetrahydrofuran -3.47 -3.734198 2-Methyltetrahydrofuran -3.3 -3.432199 Anisole -2.45 -3.1637200 Ethyl phenyl ether -4.28 -3.1649201 1,1-Diethoxyethane -3.27 -2.8988202 1,2-Dimethoxyethane -4.83 -2.8075203 1,2-Diethoxyethane -3.52 -2.6382204 1,3-Dioxolane -4.1 -3.122205 1,4-Dioxane -5.05 -4.876206 2,2,2-Trifluoroethyl vinyl ether -0.12 -0.35207 1-Chloro-2,2,2-trifluoroethyl difluoromethyl ether 0.11 -1.363208 Acetaldehyde -3.5 -3.7743209 Propanal -3.44 -2.949210 Butanal -3.18 -3.2128211 Pentanal -3.03 -2.57359212 Hexanal -2.81 -2.9175213 Heptanal -2.67 -2.2493214 Octanal -2.29 -2.5851215 Nonanal -2.08 -1.9666216 trans-2-Butenal -4.23 -4.596217 trans-2-Hexenal -3.68 -3.5553218 trans-2-Octenal -3.44 -3.0859219 trans,trans-2,4-Hexadienal -4.63 -4.2269220 Benzaldehyde -4.02 -4.391221 m-Hydroxybenzaldehyde -9.51 -9.066222 p-Hydroxybenzaldehyde -10.48 -9.198223 Acetone -3.85 -4.074224 2-Butanone -3.64 -4.234225 3-Methyl-2-butanone -3.24 -3.797226 3,3-Dimethylbutanone -2.89 -3.3775227 2-Pentanone -3.53 -3.69228 3-Pentanone -3.41 -3.1188229 4-Methyl-2-pentanone -3.06 -3.3107230 2,4-Dimethyl-3-pentanone -2.74 -2.20966231 Cyclopentanone -4.68 -3.966232 2-Hexanone -3.29 -3.6411

134

Page 149: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Appendix

233 2-Heptanone -3.04 -3.2661234 4-Heptanone -2.93 -2.7957235 2-Octanone -2.88 -3.1127236 2-Nonanone -2.49 -2.929237 5-Nonanone -2.67 -4.143238 2-Undecanone -2.16 -2.4784239 Acetophenone -4.58 -4.974240 Acetic acid -6.7 -6.446241 Propionic acid -6.47 -6.731242 Butyric acid -6.36 -5.924243 Pentanoic acid -6.16 -5.945244 Hexanoic acid -6.21 -5.513245 4-Amino-3,5,6-trichloropyridine-2-carboxylic acid -11.96 -10.62246 Methyl formate -2.78 -3.9017247 Ethyl formate -2.65 -3.0931248 Propyl formate -2.48 -2.7386249 Methyl acetate -3.31 -3.39886250 Isopropyl formate -2.02 -3.41405251 Isobutyl formate -2.22 -2.8939252 Isoamyl formate -2.13 -2.379253 Ethyl acetate -3.1 -3.38374254 Propyl acetate -2.86 -3.8644255 Isopropyl acetate -2.65 -4.012256 Butyl acetate -2.55 -3.5154257 Isobutyl acetate -2.36 -3.741258 Amyl acetate -2.45 -3.3177259 Isoamyl acetate -2.21 -3.3588260 Hexyl acetate -2.26 -3.1331261 Methyl propionate -2.93 -3.9138262 Ethyl propionate -2.8 -3.7573263 Propyl propionate -2.54 -3.3838264 Isopropyl propionate -2.22 -3.4825265 Amyl propionate -1.99 -2.8459266 Methyl butyrate -2.83 -3.3729267 Ethyl butyrate -2.5 -3.0897268 Propyl butyrate -2.28 -2.88715269 Methyl pentanoate -2.57 -3.4129270 Ethyl pentanoate -2.52 -3.1943271 Methyl hexanoate -2.49 -2.9004272 Ethyl heptanoate -2.3 -2.8331273 Methyl octanoate -2.04 -2.403274 Methyl benzoate -4.28 -6.3513275 Methylamine -4.56 -4.84276 Ethylamine -4.5 -4.099277 Propylamine -4.39 -4.149278 Butylamine -4.29 -3.904279 Pentylamine -4.1 -3.673

135

Page 150: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Appendix

280 Hexylamine -4.03 -3.5281 Dimethylamine -4.29 -3.548282 Diethylamine -4.07 -3.237283 Dipropylamine -3.66 -2.652284 Dibutylamine -3.31 -2.439285 Trimethylamine -3.23 -3.177286 Triethylamine -3.03 -3.018287 Azetidine -5.56 -4.231288 Piperazine -7.38 -10.238289 N,N'-Dimethylpiperazine -7.58 -7.921290 N-Methylpiperazine -7.77 -6.94291 Aniline -5.49 -6.306292 1,1-Dimethyl-3-phenylurea -9.63 -10.59293 N,N-Dimethylaniline -3.58 -5.564294 Ethylenediamine -9.75 -10.117295 Hydrazine -9.3 -9.103296 2-Methoxy-1-ethanamine -6.55 -6.739297 Morpholine -7.17 -6.857298 N-Methylmorpholine -6.34 -5.702299 N-Methylpyrrolidine -3.98 -3.865300 N-Methylpiperidine -3.89 -3.585301 Pyrrolidine -5.48 -3.34302 Piperidine -5.11 -4.614303 Pyridine -4.7 -4.052304 2-Methylpyridine -4.63 -3.716305 3-Methylpyridine -4.77 -3.713306 4-Methylpyridine -4.94 -4.176307 2-Ethylpyridine -4.33 -3.4308 3-Ethylpyridine -4.6 -3.509309 4-Ethylpyridine -4.74 -3.96310 2,3-Dimethylpyridine -4.83 -3.678311 2,4-Dimethylpyridine -4.86 -3.819312 2,5-Dimethylpyridine -4.72 -3.76313 2,6-Dimethylpyridine -4.6 -3.677314 3,4-Dimethylpyridine -5.22 -3.809315 3,5-Dimethylpyridine -4.84 -4.172316 2-Methylpyrazine -5.52 -4.656317 2-Ethylpyrazine -5.46 -4.449318 2-Isobutylpyrazine -5.05 -3.628319 2-Ethyl-3-methoxypyrazine -4.4 -5.173320 2-Isobutyl-3-methoxypyrazine -3.68 -2.9403321 9-Methyladenine -13.6 -14.007322 1-Methylthymine -10.4 -10.295323 Methylimidazole -10.25 -8.87324 N-Propylguanidine -10.92 -10.18325 Acetonitrile -3.89 -2.5948326 Propionitrile -3.85 -2.4743

136

Page 151: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Appendix

327 Butyronitrile -3.64 -2.2009328 Benzonitrile -4.1 -4.117329 2,6-Dichlorobenzonitrile -5.22 -4.563330 3,5-Dibromo-4-hydroxybenzonitrile -9 -8.614331 N,N-Dimethylformamide -4.9 -5.3883332 N-Methylformamide -10 -9.26333 Acetamide -9.71 -11.07334 (E)-N-Methylacetamide -10 -8.562335 (Z)-N-Methylacetamide -10 -9.602336 Propionamide -9.42 -10.807337 Methanethiol -1.24 -1.787338 Ethanethiol -1.3 -1.5067339 1-Propanethiol -1.05 -1.3026340 Thiophenol -2.55 -3.11341 Thioanisole -2.73 -2.886342 Dimethyl sulfide -1.54 -1.786343 Diethyl sulfide -1.43 -1.3491344 Methyl ethyl sulfide -1.49 -1.676345 Dipropyl sulfide -1.27 -1.0457346 2,2'-Dichlorodiethyl sulfide -3.92 -3.214347 Dimethyl disulfide -1.83 -2.83348 Diethyl disulfide -1.63 -1.8783349 Trimethyl phosphate -8.7 -10.507350 Triethyl phosphate -7.8 -9.063351 Tripropyl phosphate -6.1 -4.806352 2,2-Dichloroethenyl dimethyl phosphate -6.61 -4.648353 Dimethyl-5-(4-chloro)-bicyclo[3.2.0]-heptyl phosphate -7.28 -6.7174

354 o-Ethyl-o'-(4-bromo-2-chlorophenyl) S-propyl

phosphorothioate -4.09 -5.69355 Hydrochinone -10.77 -10.377356 1,2,3-Trimethoxybenzene -5.4 -3.5128357 1,2-Benzenediole -7.62 -9.02358 1,3-Benzenediole -9.67 -9.45359 o-Phenylenediamine -7.19 -9.892360 m-Phenylenediamine -10.26 -11.444361 2-Methylaniline -5.56 -6.083362 N-Methylaniline -4.68 -5.339

137

Page 152: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Appendix

Table A4. )tan( olocGsolvΔ (kcal mol 1− ) calculated with PM3 and the iso No Compound Exp Calc 1 Methane 0.51 0.001882 Ethane -0.64 -0.40523 Propane -1.26 -1.092 4 Cyclopropane -1.6 -0.51155 2-Methylpropane -1.45 -1.689 6 2,2-Dimethylpropane -1.74 -2.095 7 n-Butane -1.86 -1.736 8 Cyclopentane -2.65 -2.132 9 n-Pentane -2.45 -2.408 10 n-Hexane -3.01 -3.065 11 Cyclohexane -3.46 -2.782 12 Methylcyclohexane -3.21 -3.206 13 n-Heptane -3.74 -3.738 14 n-Octane -4.18 -4.38 15 Ethylene -0.27 -0.505816 Propylene -1.14 -1.111917 2-Methylpropene -2.03 -1.669 18 1-Butene -1.89 -1.93 19 1-Hexene -2.94 -3.3 20 1,3-Butadiene -2.1 -2.402 21 Acetylene -0.51 -1.596122 Propyne -1.59 -2.203 23 1-Pentyne -2.79 -3.357 24 1-Hexyne -3.43 -4.003 25 Benzene -3.72 -3.802 26 Toluene -4.55 -4.292 27 Ethylbenzene -5.08 -4.844 28 m-Xylene -5.25 -4.782 29 o-Xylene -5.07 -4.691 30 p-Xylene -5.19 -4.775 31 Naphtalene -6.97 -7.246 32 Anthracene -10.47 -10.88133 1,1-Difluoroethane -1.13 -0.946234 Tetrafluoromethane 1.5 1.476 35 Fluorobenzene -3.87 -4.381 36 Chlorodifluoromethane -1.97 -2.56 37 Dichlorodifluoromethane -1.25 -0.630338 Fluorotrichloromethane -2.63 -3.281

39 1,1,2-Trichloro-1,2,2-

Trifluoroethane -2.54 -2.871

40 1-Bromo-1-chloro-2,2,2-

trifluoroethane -3.27 -4.07 41 Bromotrifluoromethane -0.75 -0.3435842 Dichloromethane -3.07 -3.162 43 Trichloromethane -3.81 -4.555

138

Page 153: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Appendix

44 Chloroethane -2.58 -3.385 45 1,1,1-Trichloroethane -3.69 -3.98 46 1,1,2-Trichloroethane -4.53 -3.746 47 1-Chloropropane -3.06 -3.903 48 2-Chloropropane -2.84 -2.093 49 cis-1,2-Dichloroethylene -3.71 -3.092

50 trans-1,2-

Dichloroethylene -3.61 -2.543 51 Trichloroethylene -3.75 -3.116 52 Tetrachloroethylene -4.24 -3.844 53 Chlorobenzene -5 -5.555 54 1,2-Dichlorobenzene -6.01 -5.781 55 1,4-Dichlorobenzene -5.67 -6.16 56 2,2'-Dichlorobiphenyl -9.41 -8.99 57 2,3-Dichlorobiphenyl -9.23 -9.932 58 2,2,3'-Trichlorobiphenyl -9.12 -10.17459 Bromomethane -2.43 -2.562 60 Dibromomethane -4.18 -3.92 61 Tribromomethane -5.62 -5.433 62 Bromoethane -2.9 -2.63 63 1-Bromopropane -3.42 -2.943 64 2-Bromopropane -3.4 -3.464 65 1-Bromobutane -4.16 -3.881 66 1-Bromopentane -4.68 -4.59 67 3-Bromopropene -3.3 -3.546 68 Bromobenzene -5.46 -6.563 69 1,4-Dibromobenzene -7.47 -7.553 70 p-Bromotoluene -6.36 -7.149 71 Methanol -3.87 -4.24 72 Ethanol -4.36 -4.729 73 Ethylene glycol -7.44 -7.7 74 1-Propanol -5.02 -5.282 75 2-Propanol -4.62 -4.892

76 1,1,1-Trifluoro-2-

propanol -5.12 -4.713 77 Hexafluoro-2-propanol -5.76 -4.801 78 2-Methyl-1-propanol -4.78 -5.609 79 1-Butanol -5.71 -5.85 80 1-Pentanol -6.4 -6.578 81 1-Hexanol -7.06 -7.092 82 1-Heptanol -7.75 -8.082 83 1-Octanol -8.13 -8.835 84 Allyl alcohol -5.27 -5.795 85 Phenol -8.69 -7.491 86 2-Cresol -8.49 -7.441 87 3-Cresol -8.2 -8.028 88 4-Cresol -8.84 -8.036

139

Page 154: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Appendix

89 2,2,2-Trifluoroethanol -4.81 -6.601 90 p-Bromophenol -10.59 -10.17791 2-Methoxyethanol -5.83 -5.976 92 Methyl propyl ether -3.63 -3.898 93 Methyl isopropyl ether -4.64 -3.569 94 Methyl t-butyl ether -3.49 -3.683 95 Diethyl ether -2.89 -3.757 96 Tetrahydrofuran -3.93 -3.744 97 Anisole -5.47 -5.517 98 Ethyl phenyl ether -5.65 -6.012 99 1,2-Dimethoxyethane -4.55 -4.826 100 1,4-Dioxane -4.89 -5.028 101 Propanal -4.13 -4.096 102 Butanal -4.62 -4.894 103 Benzaldehyde -6.13 -7.323 104 m-Hydroxybenzaldehyde -11.39 -11.059105 p-Hydroxybenzaldehyde -12.36 -10.662106 Acetone -3.15 -3.682 107 2-Butanone -3.78 -4.401 108 3,3-Dimethylbutanone -4.53 -4.773 109 2-Pentanone -4.35 -4.689 110 3-Pentanone -4.36 -5.15 111 Cyclopentanone -5.01 -4.553 112 2-Hexanone -5.02 -5.157 113 2-Heptanone -5.65 -5.834 114 2-Octanone -6.38 -6.445 115 Acetophenone -6.74 -7.294 116 Acetic acid -6.35 -6.093 117 Propionic acid -6.86 -5.789 118 Butyric acid -7.58 -6.361 119 Pentanoic acid -8.22 -6.883 120 Hexanoic acid -8.82 -7.478

121

4-Amino-3,5,6-trichloropyridine-2-

carboxylic acid -12.37 -12.821122 Methyl formate -2.82 -4.885 123 Methyl acetate -3.54 -4.449 124 Ethyl acetate -4.06 -4.732 125 Propyl acetate -4.55 -5.362 126 Butyl acetate -4.96 -5.922 127 Methyl propionate -4.06 -4.338 128 Methyl butyrate -4.59 -5.195 129 Methyl pentanoate -5.13 -5.512 130 Methyl benzoate -7.26 -9.43 131 Methylamine -3.78 -2.665 132 Ethylamine -4.09 -3.056 133 Propylamine -4.77 -3.817

140

Page 155: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Appendix

134 Butylamine -5.35 -4.436 135 Diethylamine -4.75 -4.726 136 Dipropylamine -6.02 -5.969 137 Trimethylamine -3.6 -3.792 138 Piperazine -5.8 -7.24 139 Aniline -6.71 -6.509

140 1,1-Dimethyl-3-

phenylurea -13.12 -10.481141 Hydrazine -6.48 -4.205 142 Morpholine -5.99 -5.477 143 Piperidine -6.27 -5.154 144 Pyridine -5.34 -5.106 145 2-Methylpyridine -6.14 -5.572 146 3-Methylpyridine -6.4 -5.599 147 4-Methylpyridine -6.6 -5.783 148 2-Ethylpyridine -6.4 -6.122 149 2-Methylpyrazine -5.87 -5.344 150 2-Ethylpyrazine -6.4 -5.886

151 2-Ethyl-3-

methoxypyrazine -6.85 -6.644 152 9-Methyladenine -13.56 -12.888153 Acetonitrile -3.15 -3.5473154 Propionitrile -3.66 -3.851 155 Butyronitrile -4.25 -4.228 156 Benzonitrile -6.09 -7.448 157 2,6-Dichlorobenzonitrile -9.18 -7.326 158 1-Propanethiol -3.52 -2.524 159 Thiophenol -5.99 -5.537 160 Thioanisole -6.47 -7.907 161 Diethyl sulfide -4.09 -4.015 162 Dipropyl sulfide -3.89 -5.332 163 Dimethyl disulfide -4.24 -4.001 164 Trimethyl phosphate -7.81 -8.173 165 Triethyl phosphate -8.88 -7.976 166 Tripropyl phosphate -8.65 -8.904

167 2,2-Dichloroethenyl dimethyl phosphate -8.59 -9.513

168

o-Ethyl-o'-(4-bromo-2-chlorophenyl) s-propyl

phosphorothioate -10.49 -9.792

141

Page 156: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Appendix

Table A5. )( 3CHClGsolvΔ (kcal mol 1− ) calculated with AM1 and the iso No Compound Exp Calc 1 Cyclohexane -4.45 -3.419 2 n-Octane -5.25 -4.99 3 Benzene -4.64 -4.588 4 Toluene -5.48 -5.281 5 Ethylbenzene -5.84 -5.858 6 m-Xylene -5.86 -6.029 7 o-Xylene -6.23 -5.921 8 Naphtalene -7.89 -8.172 9 Phenanthrene -10.9 -11.58210 Fluorobenzene -4.25 -4.698 11 Dichlorodifluoromethane -1.55 -0.846812 Fluorotrichloromethane -2.62 -2.71 13 Chlorobenzene -5.45 -5.754 14 1,4-Dichlorobenzene -6.32 -6.767 15 Bromobenzene -6.07 -6.298 16 Methanol -3.32 -3.283 17 Ethanol -3.94 -4.018 18 Ethylene glycol -5.98 -4.636 19 1-Propanol -4.41 -4.396 20 2-Propanol -4.28 -4.544 21 2-Methyl-1-propanol -4.48 -5.095 22 1-Butanol -5.28 -5.009 23 1-Bentanol -5.9 -5.589 24 1-Hexanol -6.67 -6.142 25 1-Heptanol -7.53 -6.824 26 Allyl alcohol -4.34 -4.913 27 Phenol -7.14 -6.365 28 2-Cresol -7.55 -6.444 29 3-Cresol -6.7 -7.073 30 4-Cresol -7.59 -7.119 31 2,2,2-Trifluoroethanol -3.03 -4.179932 p-Bromophenol -8.59 -7.94 33 Diethyl ether -4.32 -5 34 Diisopropyl ether -3.78 -5.129 35 Anisole -6.24 -6.267 36 Ethyl phenyl ether -7.16 -7.028 37 1,4-Dioxane -6.21 -5.452 38 Acetaldehyde -3.65 -3.778439 Benzaldehyde -7.09 -7.047 40 p-Hydroxybenzaldehyde -10.3 -9.028 41 Acetone -4.42 -4.666 42 2-Butanone -5.43 -5.599 43 Acetophenone -7.81 -8.122 44 Acetic acid -4.74 -5.133 45 Propionic acid -5.37 -5.899

142

Page 157: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Appendix

46 Butyric acid -5.99 -5.93 47 Pentanoic acid -6.61 -6.608 48 Hexanoic acid -7.51 -6.992 49 Methyl acetate -4.9 -4.767 50 Ethyl acetate -5.58 -5.449 51 Propyl acetate -6.35 -6.336 52 Butyl acetate -6.71 -6.744 53 Amyl acetate -7.36 -7.316 54 Methyl propionate -5.48 -5.484 55 Methyl pentanoate -6.68 -6.462 56 Methyl hexanoate -7.24 -6.911 57 Methyl benzoate -7.81 -9.103 58 Methylamine -3.17 -3.961 59 Ethylamine -4.02 -4.222 60 Propylamine -4.73 -4.96 61 Butylamine -5.35 -5.543 62 Dimethylamine -3.69 -4.317 63 Diethylamine -5.23 -5.7 64 Trimethylamine -3.9 -5.085 65 Aniline -7.34 -6.985

66 1,1-Dimethyl-3-

phenylurea -13.64 -12.18767 Hydrazine -7.46 -5.937 68 Morpholine -6.72 -6.598 69 Piperidine -6.37 -6.554 70 Pyridine -6.45 -6.232 71 2-Methylpyridine -6.98 -6.913 72 3-Methylpyridine -7.35 -6.707 73 4-Methylpyridine -7.5 -7.047 74 2,6-Dimethylpyridine -7.74 -7.809 75 2-Methylpyrazine -6.99 -7.196 76 2-Ethylpyrazine -7.72 -7.79 77 9-Methyladenine -12.51 -13.38478 1-Methylthymine -9.71 -10.47679 Acetonitrile -4.44 -3.509 80 Benzonitrile -7.22 -7.184 81 Acetamide -7.05 -8.17 82 Thiophenol -7.61 -7.253 83 Thioanisole -5.98 -7.656 84 Diethyl sulfide -6.4 -5.462 85 Trimethyl phosphate -9.74 -9.232 86 Triethyl phosphate -10.9 -11.08587 Tripropyl phosphate -11.11 -11.38

143

Page 158: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Appendix

Table A6. The theoretical logPow for small data set calculated with PM3 and the isoNo Compound Exp Calc 1 n-Nonane 5.65 5.9077 2 n-Undecane 6.54 7.127213 n-Dodecane 6.8 7.745024 n-Tridecane 7.56 8.351845 n-Tetradecane 8 8.9506 6 2,3-Dimethylbutane 3.85 3.7274 7 3,3-Dimethylheptane 5.19 5.556668 Hept-1-ene 3.99 4.0997 9 Non-1-ene 5.15 5.3199410 Isobutene 2.34 2.0491211 But-2-yne 1.46 1.3741412 Nonan-1-ol 3.67 3.9091513 Undecan-1-ol 4.72 5.1037414 Dodecan-1-ol 5.13 5.7259515 Tridecan-1-ol 5.82 6.3313 16 Tetradecan-1-ol 6.36 6.9373917 Hexan-2-ol 1.76 1.6966 18 Heptan-2-ol 2.31 2.3364 19 Heptan-3-ol 2.24 2.1348620 Octan-2-ol 2.9 2.9271 21 Octan-4-ol 2.68 2.7600122 2-Methylpropan-2-ol 0.35 0.1128623 3-Methylbutan-2-ol 1.28 1.84904

24 2,3-Dimethylbutan-2-

ol 1.48 1.3426325 Ethyl-n-butyl ether 2.03 2.0212726 Decan-2-one 3.73 3.0414327 5-Methylhexan-2-one 1.88 1.2524828 5-Methyloctan-2-one 2.92 2.4001629 Heptanoic acid 2.41 2.3642530 Octanoic acid 3.05 2.9131831 Decanoic acid 4.09 4.0777132 Tetradecanoic acid 6.1 6.4425533 Hexadecanoic acid 7.17 7.6220434 Octadecanoic acid 8.23 8.8060735 Eicosanoic acid 9.29 10.0142936 3-Methylbutanoic acid 1.16 1.3785437 2-Ethylbutanoic acid 1.68 1.92453

38 2-Methylpentanoic

acid 1.8 1.9186739 2-Propylpentanoic acid 2.75 3.2033940 2-Ethylhexanoic acid 2.64 3.2202541 n-Pentyl propanoate 2.67 2.6677442 n-Butyl pentanoate 3.36 3.3828 43 Ethyl isobutanoate 1.55 1.32709

144

Page 159: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Appendix

44 1-Cyanopropane 0.53 0.8710245 1-Cyanobutane 1.12 1.4605546 1-Cyanopentane 1.66 2.0125547 2-Cyanopropane 0.46 0.8999748 2-Cyanobutane 1.1 1.3073

49 1-Cyano-2-

methylpropane 1.1 1.48722

50 2-Cyano-2-

methylpropane 1.08 1.4380551 n-Heptylamine 2.57 3.0194452 n-Octylamine 3.09 3.6167453 n-Nonylamine 3.6 4.2257654 Isopropylamine 0.26 0.9717955 Isobutylamine 0.73 0.9813256 sec-Butylamine 0.74 0.8640657 tert-Butylamine 0.4 1.3829358 Diisopropylamine 1.16 1.5097259 Butanamide -0.21 0.0857560 Ethylthiol 1.18 1.9882961 n-Propylthiol 1.81 1.4209 62 n-Butylthiol 2.28 2.0519 63 1-Fluorobutane 2 1.8241264 1-Fluoropentane 2.33 2.3803865 1-Chlorohexane 3.66 3.7241 66 1-Chloroheptane 4.15 4.2909867 1-Chlorooctane 4.73 4.8267168 1-Bromohexane 3.8 3.6050269 1-Bromoheptane 4.36 4.2401270 1-Bromooctane 4.89 4.7963771 1-Bromodecane 6 5.8904172 Isopropylbenzene 3.66 3.73341

73 1,2,3-

Trimethylbenzene 3.66 3.30057

74 1,3,5-

Trimethylbenzene 3.59 3.5829575 2-Ethyltoluene 3.53 3.4432676 4-Ethyltoluene 3.63 3.7441877 4-Isopropyltoluene 4.1 4.35056

78 1,2,3,4-

Tetramethylbenzene 3.98 3.75078

79 1,2,3,5-

Tetramethylbenzene 4.04 3.93979

80 1,2,4,5-

Tetramethylbenzene 4 3.9634681 n-Pentylbenzene 4.9 5.0319282 Pentamethylbenzene 4.56 4.2543483 n-Hexylbenzene 5.52 5.6416

145

Page 160: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Appendix

84 n-Octylbenzene 6.34 6.8604485 n-Decylbenzene 7.4 8.0806886 2,4-Dimethylphenol 2.3 1.8021487 2,5-Dimethylphenol 2.33 1.8446588 2,6-Dimethylphenol 2.36 1.8556489 3,4-Dimethylphenol 2.23 2.0630490 3,5-Dimethylphenol 2.35 1.7046791 2-Ethylphenol 2.47 1.9531192 3-Ethylphenol 2.4 2.4221593 4-Ethylphenol 2.58 1.9648494 2,4,6-Trimethylphenol 2.97 2.4697995 2,3,6-Trimethylphenol 2.67 2.2557996 1,2,3-Trichlorobenzene 4.05 3.9296 97 1,2,4-Trichlorobenzene 4.02 3.4064798 1,3,5-Trichlorobenzene 4.19 3.05873

99 1,2,3,4-

Tetrachlorobenzene 4.64 3.94045

100 1,2,3,5-

Tetrachlorobenzene 4.65 3.54198

101 1,2,4,5-

Tetrachlorobenzene 4.6 2.91003102 Pentachlorobenzene 5.18 3.48343103 Hexachlorobenzene 5.73 3.69628104 1,2-Dibromobenzene 3.64 3.55591105 1,3-Dibromobenzene 3.75 3.38148106 Dimethyl ether 0.1 -0.59729107 Methyl nonanoate 3.87 3.86972108 Methyl decanoate 4.41 4.48803109 Methylthiol 0.65 1.21071110 Cyclooctane 4.45 4.34961111 Cyclooctanol 2.39 2.48884112 Cyclohexanone 0.81 0.15757113 Cyclododecanone 4.1 3.07294114 Formic acid -0.54 -0.13045115 Formamide -1.51 -0.78198

116 N,N-

Dimethylacetamide -0.77 -0.2653 117 N,N-Diethylacetamide 0.34 1.61291

118 N,N-

Dimethylpropanamide -0.11 0.57384119 Methyl acrylate 0.8 0.10993120 Ethyl acrylate 1.32 0.74196121 n-Butyl acrylate 2.36 2.1043 122 Isobutyl acrylate 2.22 2.18408123 Methyl methacrylate 1.38 0.99818124 Ethyl methacrylate 1.94 1.34087125 n-Butyl methacrylate 2.88 2.64553

146

Page 161: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Appendix

126 Isobutyl methacrylate 2.66 2.73953127 1,4-Dichlorobutane 2.24 1.92607128 Diethoxymethane 0.84 1.93911129 Cyclohexanediol 0.16 0.0601 130 1,2-Ethanediol -1.36 -0.19861131 2,3-Butanediol -0.92 -0.52767132 Ethoxyethanol -0.32 0.50715133 Butoxyethanol 0.83 1.88129134 1,4-Butanediol -0.83 -0.64346135 1-Methylcyclohexanol 1.33 1.56176136 Tetrahydropyran 0.95 1.05607

137 4-

Methylcyclohexanone 1.38 0.81276138 5-Phenylpentan-2-one 2.42 1.57421139 Glutaric acid -0.29 -0.58483

140 Cyclohexanecarboxylic

acid 1.96 1.78748141 Tripropylamine 2.79 3.76331142 Cyclohexylamine 1.49 1.96117

143 Methyl 4-

phenylbutanoate 2.77 2.38038144 Dimethyl adipate 1.03 0.72481145 4-Chlorobutyronitrile 0.56 0.71822146 Octanenitrile 2.73 3.14946147 4-Phenylbutyramid 1.41 2.57019148 4-Chlorobutanol 0.85 0.6867 149 Cyclopropylamine 0.07 0.18542

150 4-tert-

Butylcyclohexanol 3.06 3.57936151 Bromochloromethane 1.41 0.63313152 Difluoromethane 0.2 0.13859153 Thymol 3.3 3.02897154 Bibenzyl 4.79 5.03558155 Crotonic acid 0.72 0.25138156 Acrylamide -0.67 -0.71089157 1,2-Dibutoxyethane 2.48 4.78567

147

Page 162: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Appendix

Table A7. LogPow (SIM) for large data set calculated with AM1 and the iso

No Compound Exp Calc 1 Ethane 1.81 1.4542 Cyclopropane 1.72 1.2133 Butane 2.89 2.3744 1-Pentyne 1.98 2.1125 1,5-Hexadiene 2.87 2.7686 Allylbenzene 3.23 3.437 1,3,5-Trimethylbenzene 3.42 3.358 1-Methylnaphthalene 3.87 3.6949 Acenaphthene 3.92 3.65110 2,6-Dimethylnaphthalene 4.31 4.07911 1,4-Dimethylnaphthalene 4.37 4.13812 1,7-Dimethylnaphthalene 4.44 4.1913 1,8-Dimethylnaphthalene 4.26 3.98514 Hexamethylbenzene 4.61 4.05315 Fluorene 4.18 3.86916 1-Methyl-9H-fluorene 4.97 4.1517 Bibenzyl 4.79 4.84518 Pyrene 4.88 4.70719 Fluoranthene 5.16 4.75920 Benz(A)anthracene 5.79 5.38821 Ethyleneoxide -0.3 0.461622 Ethanol -0.31 0.163723 Furan 1.34 1.12224 Butanol 0.88 0.826425 3-Pentanol 1.21 1.23426 Cyclohexanol 1.23 1.26927 3,3-Dimethyl-2-butanol 1.48 1.6528 m-Methylphenol 1.96 1.89629 2,6-Dimethylphenol 2.36 1.94530 2,6-Dimethylcyclohexanol 2.38 1.85131 Octanol 3 2.73932 3-Phenyl-propanol 1.88 2.35833 TR-2-Phenylcyclopropylcarbinol 1.95 1.81934 p-tert-Butylphenol 3.31 3.15135 Adamantane 2.14 2.37536 1-Dodecanol 5.13 4.49337 Phenyl-p-tolylcarbinol 3.13 3.80938 Acetone -0.24 0.208139 2-Butanone 0.29 0.713640 1-Hexen-5-one 1.02 1.23841 4-Cyclopropyl-2-butanone 1.5 1.26542 Phenylacetaldehyde 1.78 1.90443 1-Phenylpropan-2-one 1.44 2.01844 2-Nonanone 3.14 2.81345 n-Propylformate 0.83 0.4846

148

Page 163: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Appendix

46 Ethylacetate 0.73 0.222147 Ethylpropionate 1.21 0.473748 Phenylacetate 1.49 1.60449 2-Propylpentanoic acid 2.75 1.45250 p-Tolylacetate 2.11 2.1451 Ethylbenzoate 2.64 2.11352 beta-Phenylpropionic acid 1.84 1.56553 Methyl-3-phenylpropionate 2.32 2.51954 Decanoic acid 4.09 2.57455 (2-Propan-2-ylphenyl)acetate 2.78 2.85856 1-Acetoxynaphthalene 2.78 2.68457 Propionitrile 0.16 0.926858 Isopropylamine 0.26 -1.58E-0259 Methylbutylamine 1.33 1.55560 Amylamine 1.49 1.06361 Triethylamine 1.45 1.96562 Dipropylamine 1.67 2.09563 Hexylamine 2.06 1.59564 m-Toluidine 1.4 1.29765 Propyl-isobutylamine 2.07 2.3666 N,N-Dimethylaniline 2.31 2.5167 N-Ethylaniline 2.16 2.25768 N-Methyl-p-toluidine 2.15 1.87669 N-Propylaniline 2.45 2.46270 N,N,4-Trimethylaniline 2.81 2.59271 N,N-Dimethyl-1-phenylmethanamine 1.98 2.64272 Tripropylamine 2.79 3.58273 1-Phenylbutan-2-amine 2.28 2.34874 Diphenylmethylamine 3.9 4.02975 N-Methylacetamide -1.05 -0.834576 Benzamide 0.64 0.822477 Acetanilide 1.16 1.42878 N-Methylformanilide 1.09 1.58379 Phenylacetamide 0.45 0.584980 Allyl-isopropyl-acetamide 1.14 1.03281 2-Propan-2-ylpentanamide 1.48 1.42182 Cinnamamide 1.43 1.54883 4-propan-2-ylbenzamide 2.14 1.86984 Isobutyranilide 1.95 1.81285 4-Phenylbutanamide 1.41 1.73386 [(E)-2-Nitrovinyl]benzene 2.11 1.97287 Carbon tetrachloride 2.83 3.58188 Difluoromethane 0.2 -0.311489 Methylenechloride 1.25 1.57990 Methylchloride 0.91 0.639891 Tetrachloroethylene 3.4 3.27192 Trichloroethylene 2.61 2.348

149

Page 164: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Appendix

93 1,1-Difluoroethylene 1.24 0.770994 Ethylchloride 1.43 1.53695 Ethylbromide 1.61 2.43596 1-Chloropropane 2.04 2.36897 2-Chloropropane 1.9 1.87398 1-Bromopropane 2.1 3.09699 1-Bromopentane 3.37 3.836100 1,3-Dibromobenzene 3.75 2.707101 Iodobenzene 3.25 2.606102 alpha-Chlorotoluene 2.3 2.322103 1-Methylpentachlorocychlohexane 4.04 4.254104 1,4-Dimethyltetrachlorocyclohexane 4.4 4.794105 3-Chlorobiphenyl 4.58 4.284106 Adenosine -1.05 -1.053107 2,6-Diaminopurine 2’-Deoxyriboside -0.52 -1.962108 deoxyuridine -1.62 -0.7789109 3'-Fluoro-2’,3’-dideoxyuridine -0.49 1.094110 3'-Deoxy-3'-fluorothymidine -0.28 0.3543111 2-amino-6-bromo-2’,3’-dideoxypurine 0.34 0.5567112 8-Azaadenine -0.96 1.11113 9-Butylpurin-6-amine 1.25 0.6148114 9-Pentylpurin-6-amine 1.79 1.009115 Cytosine -1.73 -0.1575116 Hypoxanthine -1.11 -0.1446117 4-Nitropyrazole -0.59 6.35E-02118 5-Chlorouracil -0.35 0.1955119 1-Methyluracil -1.2 -0.3744120 Tribromoethene 3.2 3.2121 2,2,2-Trichloroethanol 1.42 2.502122 Bromoacetic acid 0.41 0.5277123 Fluoroacetamide -1.05 -1.239124 Hydroxyacetic acid -1.11 -1.206125 2-Fluoroethanol -0.76 -0.2858126 2-Chloroethanol -0.06 0.7047127 2-Bromoethanol 0.18 1.337128 4-Nitropyrazole 0.59 1.102129 Pyrazole 0.26 5.11E-02130 Imidazoline-2-thiol -0.66 8.41E-02131 alpha-Hydroxypropionic acid -0.72 -0.8975132 Glycerol -1.76 -1.664133 Barbituric acid -1.47 -0.7828134 2-Methyl-4,5-dihydro-1H-imidazole 0.52 -0.6484135 N-Nitrosothiomorpholine 0.4 0.8287136 N-Nitrosomorpholine -0.44 0.2063137 3,5,6-Trichloro-2-pyridinol 3.21 2.737138 2,3-Dichloropyridine 2.11 2.3139 3,5-Dichloropyridine 2.56 2.404

150

Page 165: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Appendix

140 Uric acid -2.17 -0.5138141 Pyridine 0.65 1.046142 Acetylacetone 0.4 -0.1067143 D-Valerolactone -0.35 0.1463144 1,3-Diacetyl-urea -0.68 -2.305145 N-Methylmorpholine -0.33 3.34E-02146 2,4,6-Tribromophenol 4.13 3.898147 3-Cyanopyridine 0.23 1.401148 4-Cyanopyridine 0.46 1.519149 m-Iodonitrobenzene 2.94 1.967150 o-Dinitrobenzene 1.69 2.017151 p-Dinitrobenzene 1.46 1.902152 o-Fluorophenol 1.71 1.19153 p-Fluorophenol 1.77 1.123154 3,4-Dichlorobenzenesulfonamide 1.44 1.722155 p-Fluoroaniline 1.15 0.576156 o-Chloroaniline 1.9 1.733157 o-Nitroaniline 1.85 1.038158 p-Nitroaniline 1.39 1.017159 2-Picoline 1.11 1.436160 Phenylhydroxylamine 0.79 0.8007161 p-Aminophenol 0.04 0.2716162 Piconol 0.06 0.4649163 Phenylsulfamide 0.4 0.736164 Sulfanilamide -0.62 2.60E-02165 3-Methyl-N-nitrosopiperidine 0.99 1.345166 Diethylacetal 0.84 1.125167 3,5-Diiodosalicylic acid 4.56 3.321168 Phenylisothiocyanate 3.28 2.262169 p-Bromobenzoic acid 2.86 2.044170 m-Iodobenzoic acid 3.13 1.569171 7-Azaindole 1.82 1.097172 o-Phenyleneurea 1.12 0.9036173 2-Chloro-4-aminobenzoic acid 1.33 0.8486174 m-Nitrobenzamide 0.77 0.378175 p-Nitrobenzamide 0.82 0.5428176 Benzaldoxime 1.75 1.649177 o-Hydroxybenzamide 1.28 3.74E-02178 p-Hydroxybenzamide 0.33 -3.49E-03179 p-Aminobenzoic acid 0.83 0.3551180 3-Hydroxy-4-aminobenzoic acid 0.5 -0.1222181 p-Nitrobenzyl alcohol 1.26 1.146182 Benzoylhydrazine 0.19 -9.23E-02183 4-Pyridineacetamide -0.65 -9.63E-02184 Phenylurea 0.83 0.3346185 Methylphenylsulfoxide 0.55 1.093186 Salicyl alcohol 0.73 0.9729

151

Page 166: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Appendix

187 2,4-Diaminobenzoic acid -0.31 -0.7212188 Thiophene-2-carboxylic acid ethylester 2.33 2.032189 3-Methylsulfanylaniline 1.45 0.5749190 m-Methylbenzenesulfonamide 0.85 0.8004191 4-Pyridineethaneamine -0.01 0.5096192 1-H-2-Methoxytetrachlorocyclohexane 2.99 3.521193 Sulfaguanidine -1.22 -1.08194 1,3-Diallylurea 0.64 0.3586195 1-Nitrosotriethylurea 1.54 1.037

196 4,5,7-Trichloro-2-

(trifluoromethyl)benzimidazole 3.78 4.189

197 1,2,4-Benzothiadiazine-1,1-dioxide-3-

trifluoromethyl-6-chloro 1.65 2.38

198 6-Nitro-2-(trifluoromethyl)-1H-

benzimidazole 2.68 3.092199 2-(Trifluoromethyl)-1H-benzimidazole 2.67 2.666200 m-Cyanobenzoic acid 1.48 1.205201 p-Cyanoformanilide 1.08 1.249202 4-Chloro beta-nitrostyrene 2.44 2.362

203 6-Chloro-2-methylsulfanyl-1H-

benzimidazole 3.22 2.59204 2-Aminoquinazoline-4-one 0.6 0.7154205 2-(Fluorophenyl)acetate 1.76 1.407206 3-Fluorophenylacetate 1.74 1.346207 3-Chlorophenylacetate 2.32 2.284208 2-Bromophenylacetate 2.2 2.419209 m-Iodophenylacetic acid 2.62 1.64210 2-(2,4,6-Tribromophenoxy)ethanol 3.42 3.925211 m-Nitroacetophenone 1.42 1.553212 2-(2-Fluorophenoxy)acetic acid 1.39 1.009213 2-(3-Chlorophenoxy)acetic acid 2.03 1.857214 2-(2-Nitrophenoxy)acetic acid 0.97 0.8384

215 1-(3,3,3-

Trifluoroethoxy)pentachlorocychlohexane 4.06 4.468216 S-Phenylethanethioate 2.23 2.104217 p-Fluoroacetanilide 1.47 0.4283218 2-Hydroxyacetophenone 1.92 1.404219 4-Methoxybenzoic acid 2.74 0.8878220 Iso-phthalamide -0.21 -0.7384221 N-Methyl-2-fluorophenylcarbamate 1.25 0.9564222 N-Methyl-3-fluorophenylcarbamate 1.48 1.3223 N-Methyl-4-fluorophenylcarbamate 1.28 1.201224 2-Chlorophenylcarbamate,o-methyl 2.13 1.894225 N-Methyl-3-bromophenylcarbamate 2.25 1.861226 N-Methyl-2-iodophenylcarbamate 1.94 1.946

227 1,2,4-Benzothiadiazine-1,1-dioxide-3-

methyl-6-amino-7-chloro 0.63 1.198

152

Page 167: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Appendix

228 Methyl 3-hydroxybenzoate 1.89 1.216229 m-Hydroxyphenylacetic acid 0.85 0.6542230 o-Hydroxyphenylacetic acid 0.85 0.5124231 (2-Hydroxyphenoxy)acetic acid 0.85 0.6071232 p-Methylsulfonylbenzoic acid 0.67 0.6285233 p-Methoxyformanilide 1.03 1.195234 o-Methoxybenzamide 0.84 1.015235 p-Methoxybenzamide 0.86 0.7343236 2-Pyridine propanamide -0.27 0.2265237 2-Phenoxyethanol 1.16 2.016238 2-Pyridinepropanol 0.58 1.108239 4-Pyridinepropanol 0.58 1.318240 p-Ethylbenzenesulfonamide 1.31 1.145241 2-Chloroquinoline 2.71 2.781242 8-Chloroquinoline 2.44 2.612243 5-Nitroquinoline 1.86 2.029244 7-Nitroquinoline 1.82 1.876245 Phenoxyacetic acid,3-cyano-4-chloro 1.56 2.391246 Quinoline 2.03 2.173247 4-Hydroxyquinoline 0.75 1.729248 8-Quinolinol 2.02 1.462249 p-Trifluoromethylphenylacetic acid 2.45 1.251250 (4-Cyanophenoxy)-acetic acid 0.93 1.23251 4-Aminoquinoline 1.63 1.465252 3-Aminoquinoline 1.63 1.324253 Acetylsalicylic acid 1.19 0.6613254 p-Acetylformanilide 0.94 0.838255 3-Acetamidobenzoic acid 1.32 0.3003256 5,6-Dimethyl-1H-benzimidazole 2.35 1.603257 p-N-Acetylaminobenzamide 0.01 3.88E-02258 p-Methoxyphenylacetic acid 1.42 0.8824259 o-Methylphenoxyacetic acid 1.98 1.432260 p-Methylphenoxyacetic acid 1.86 1.539261 2-(3-Methoxyphenoxy)acetic acid 1.38 1.646262 Phenoxyacetic acid,m-methylsulfonyl 0.01 0.7186263 3-(4-Chlorophenyl)-1,1-dimethylurea 1.94 1.805264 Ethyl 2-aminobenzoate 2.57 1.284265 N-methyl-o-tolycarbamate 1.46 1.608266 N-phenyl ethylcarbamate 2.3 1.816267 1,1-Dimethyl-3-p-nitrophenylurea 1.51 1.677268 1-Phenyl-3-ethylthiourea 1.42 3.381269 1,3-Dimethyl-1-phenylurea 1.02 0.9155270 2-Pyridinebutanol 0.86 1.59271 4-Pyridinebutanamine 0.86 1.343272 5-Ethyl-5-isopropylbarbituric acid 0.97 0.4037273 N-Methyl-5-butylbarbuturic acid 1.1 1.055274 3-Isopropylthio-4-amino-6-isopropyl- 2.06 1.747

153

Page 168: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Appendix

1,2,4tTriazine-5-one275 1,3-Dibutylurea 1.4 -0.798

276 (3,4,5-Trichlorophenyl)hydrazono]cyano-

acetic methyl ester 5.22 4.301

277 Butanenitrile,2[(3,4-

dichlorophenyl)hydrazono]-3-oxo 4.56 3.482

278

Methyl(2Z)-2-[(3-chlorophenyl)hydrazinylidene]-2-cyano

acetate 3.56 3.03279 5-Methyl-8-quinolinol 2.37 1.708280 4-Methyl-5,8-dihydroxyquinoline 1.59 1.215

281 2-N,N-Dime-6(5-NO2-2-furyl)-1,3-

thiazine-4-one 0.55 2.568

282 3-Methio-4-amine-6-phenyl-1,2,4-

triazine-5-one 1.66 1.259283 Benzoylacetone 2.52 1.905

284 1,2,4-Benzothiadiazine-1,1-dioxide-3-

methyl-6-acetylamino-7-chloro 0.53 0.8317285 Phenoxyacetic acid,3-acetamido-4-chloro 0.75 0.8014286 o-ethyl phenoxyacetic acid 2.53 1.966287 5,5-Diallylbarbituric acid 1.05 1.012288 p-Aminohippuric acid,methyl ester -0.23 0.376289 N-(4-Ethoxyphenyl)acetamide 1.58 1.708290 Fuscaric acid 0.68 1.797291 3,5-Dimethyl-4-hydroxyacetanilide 1.11 1.167292 3-Ethyl-4-hydroxyacetanilide 1.31 1.495293 3,5-Dimethylphenyl methylcarbamate 2.23 1.998294 N-Methyl-3-ethoxyphenylcarbamate 1.75 1.726295 N-Methyl-4-ethoxyphenylcarbamate 1.63 1.695296 4-Sulfamylbenzoic acid,propyl ester 1.75 1.035297 2-(3’-Pyridyl) piperidine 0.97 1.481298 Nikethamide 0.33 1.414299 4-Pyridinepentanol 1.39 2.116

300 5-tert-Butyl-5-ethyl-1,3-diazinane-2,4,6-

trione 1.73 0.874301 1-Isothiocyanonaphthalene 4.34 3.466

302

(2-Chloro-4-trifluoromethylphenyl)hydrazono]cyano-

acetic acid methyl 4.66 3.677

303 (2,4,5-Trichlorophenyl)hydrazono]-cyano-

acetic acid ethyl ester 5.21 4.49

304 (3-Chloro-phenyl)hydrazono] cyano-

acetic acid ethyl ester 3.94 3.414

305 (2-Chlorophenyl)hydrazono] cyano-acetic

acid ethyl ester 3.38 2.813306 8-Dimethylaminoquinoline 2.73 2.178307 3-Acetyl phenyl dimethylcarbamate 1.18 1.774308 Sulfisoxazole 1.01 1.26

154

Page 169: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Appendix

309 2,3,5-Trimethyl-4-hydroxyacetanilide 0.82 0.5573

310 (2-Propan-2-yloxyphenyl) N-

methylcarbamate 1.52 1.831311 3-Isopropoxyphenyl N-methylcarbamate 1.96 2.109312 1,7-Phenanthroline 2.51 2.426313 Phenazine 2.84 3.035314 1-Naphthyl N-methylcarbamate 2.36 2.621315 o-Phenoxyaniline 2.46 3.014316 4,4'-Diamino diphenyl ether 1.36 1.499

317 5,6-Dihydro-2-methyl-1,4-oxathiin-3-

carboxamide 2.14 2.314318 2-Iodophenyl-beta-D-glucopyranoside 0.27 1.079319 (o-sec-Butylphenoxy)acetic acid 3.32 2.752320 2-sec-Butylphenyl N-methylcarbamate 2.78 2.477

321 N-Methyl-3-methyl-6-isopropyl

phenylcarbamate 2.84 2.515322 N-Methyl-4-sec-butylphenylcarbamate 3.2 2.866323 N-Methyl-4-tert-butylphenylcarbamate 3.06 2.799324 N-Methyl-4-butoxyphenylcarbamate 2.86 2.454325 2-I-Pentoxy-4-aminobenzoic acid 2.3 1.852326 2-Aminophenyl beta-d-glucopyranoside -1.23 -1.047327 Thioxanthone 3.99 2.616328 (4-Isothiocyanatophenyl)-phenyldiazene 5.55 4.135329 4-Isothiocyanaodiphenylether 4.75 4.142330 4-Isothiocyanodiphenylsulfoxide 4.4 2.757331 1-Aminoacridine 2.47 2.696332 p-Hydroxybenzophenone 3.07 2.975333 m-Phenoxybenzoic acid 3.91 2.96334 2-(3-Methylphenyl) acetic acid 1.95 1.362335 3-Chlorophenyl 4-aminosalicylate 3.9 2.888336 4-Bromophenyl 4-aminosalicylate 3.46 3.103

337 4,4-Dimethyl-3-oxo-2-[(2,4,5-

trichlorophenyl) hydrazono]pentanenitrile 5.86 5.105338 Ethyl 3-oxo-5-phenylpentanoate 2.52 2.269339 Aminopyrine 1 1.971340 3,4-Diethoxyphenylcarbamate,o-ethyl 2.5 2.483341 Probenecid 3.21 2.097342 4,4'-Diisothiocyanatebiphenyl 5.5 4.352343 4-Isothiocyanobenzophenone 4.88 3.949

344 (4-Methylphenyl) 4-amino-2-

hydroxybenzoate 3.38 2.533345 5-Benzyl-3-furanmethanol acetate 3.24 3.222346 3,5-Di-propyl-4-hydroxyacetanilide 2.67 2.765347 Diethofencarb 2.82 2.529348 2-Isothiocyanoanthracene 5.7 4.66349 Apigenin 1.74 2.559350 4-Aminosalicylic acid,2,6-dimethylphenyl 3.88 2.86

155

Page 170: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Appendix

ester351 Cycloheximide 0.55 6.70E-02352 Metoprolol 1.88 2.566353 Medazepam 4.41 4.641354 Tetrazepam 2.76 3.578355 Propanolol 2.98 3.657356 2,6-Diphenylpyridine 4.82 4.843357 Chlorpromazine 5.19 5.059358 Diphenhydramine 3.27 4.806359 Mepyramine 3.27 3.284360 Estrone 3.13 3.325361 Acebutolol 1.71 1.376362 Isothazine 4.77 5.972363 Benzo(A)pyrene 5.97 5.712364 o,o-diethyl-o-phenylphosphate 1.64 1.907365 [Methoxy(methyl)phosphoryl]oxymethane -0.66 -0.1839366 o,S-Dime-N-BU-phosphoramidothioate 0.94 0.2227367 Dicapthon 3.58 2.841368 2-Phenoxy-1,3,2-dioxaphospholane 1.42 1.202

Table A8. The theoretical logPcw calculated with AM1 and the iso No Compound Exp Calc 1 p-Toluic acid 1.4 1.48 2 1-Naphtol 1.2 1.96 3 iso-Butyl alcohol 0.3 1.03 4 m-Chlorophenol 1.0 1.76 5 Propionamide -1.4 -1.55 6 p-Hexylpyridine 5.0 5.34 7 p-Chlorophenol 1.0 0.57

156

Page 171: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Appendix

Table A9. The total data set of 144 PPL compounds used (training and test set) No Compound Inductance 1 17 alpha-Ethinylestradiol Negative 2 1-Amino-4-octylpiperazine Positive

3 1-Chloro-10,11-

dehydroamitriptyline Positive 4 1-Chloroamitriptyline Positive 5 6-Hydroxy-dopamine Positive 6 Abacavir Negative 7 Amineptine Negative 8 Amodiaquine Positive 9 Azaserine Negative 10 Azithromycin Positive 11 Bilirubin Positive 12 Brompheniramine Positive 13 Caffeine Negative 14 Carbamazepine Negative 15 Carbon Tetrachloride Negative 16 Ceftazidime Negative 17 Chloroquine Positive 18 Chlorpromazine Positive 19 Chlortetracycline Negative 20 Ciprofibrate Negative 21 Clociguanil Positive 22 Clofibrate Negative 23 Colchicine Negative 24 Cyclizine Positive 25 Cyproterone Acetate Negative 26 Desipramine Positive

27

(d-)H-4,4-bis-Diethylaminoethoxy-diethylphenylethane Positive

28 Dibucaine Positive 29 Erythromycin Positive 30 Etoposide Negative 31 Felbamate Negative 32 Fenofibrate Negative 33 Fluoxetine Positive 34 Galactosamine Negative 35 Gemfibrozil Negative 36 Gentamicin Positive 37 Hydroxyzine Positive 38 Hypoglycine A Negative 39 Iprindole Positive 40 Ketoconazole Positive 41 Lysergide Positive 42 Methotrexate Negative

157

Page 172: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Appendix

43 Methyldopa Negative 44 Norchlorcyclizine Positive 45 Nortriptyline Positive 46 Phenacetin Positive 47 Pheniramine Positive 48 Phenobarbital Negative 49 Phentermine Positive 50 Physostigmine Negative 51 Piroxicam Negative 52 Quinine Positive 53 R-800 Positive 54 RMI10393 Positive 55 SC-45864 Positive 56 Stilbamidine Positive 57 Sulindac Negative 58 Suramin Positive 59 Tamoxifen Positive 60 Temozolomide Negative 61 Tetracaine Positive 62 Thioacetamide Negative 63 Tilorone Positive 64 Tobramycin Positive 65 Tocainide Positive 66 Trimeprazine Positive

67 Trimethoprim

sulfamethoxazole Positive 68 Trospectomycin sulfate Positive 69 Valproic Acid Negative 70 WY-14643 Negative 71 Zidovudine Negative 72 Zileuton Negative 73 3-Methylcholanthrene Negative 74 AC 3579 Positive 75 Acetaminophen Negative 76 Amikacin Positive 77 Amiodarone Positive 78 Amitriptyline Positive 79 Anticoman Negative 80 Aricept Negative 81 AY-25329 Negative 82 AY-9944 Positive 83 Bicalutamide Negative 84 Boxidine Positive 85 Bupropion Negative 86 Cephaloridine Positive 87 Chlorcyclizine Positive 88 Chloroform Negative

158

Page 173: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Appendix

89 Chlorphentermine Positive 90 Citalopram Positive 91 Clindamycin Positive 92 Clomipramine Positive 93 Clozapine Positive 94 Cocaine Positive 95 Coralgil Positive 96 Dantrolene Negative 97 Demeclocycline Negative 98 Desferal Negative 99 Dibekacin Positive 100 Diclofenac Negative 101 Diflunisal Negative 102 di-Isobutamide Positive 103 Doxapram Negative 104 Doxycycline Negative 105 Emetine Positive 106 Ethyl fluclozepate Positive 107 Famotidine Negative 108 Fenfluramine Positive 109 Flutamide Negative 110 Homochlorocyclizine Positive 111 Hydrazine Negative 112 Hydroxyurea Negative 113 IA3 Positive 114 Imipramine Positive 115 Indoramin Positive 116 L-ethionine Negative 117 Maprotiline Positive 118 Meclizine Positive 119 Mesoridazine Positive 120 Metformin Negative 120 Methadone Negative 122 Methapyrilene Negative 123 Mianserin Positive 124 Netilmicin Positive 125 Noxiptiline Positive 126 Paraquat Positive 127 Perhexiline Positive 128 Procaine Negative 129 Promethazine Positive 130 Propranolol Positive 131 Quinacrine Positive 132 Quinidine Positive 133 Rolitetracycline Negative 134 SDZ_200-125 Positive 135 SKF-14336-D Positive

159

Page 174: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Appendix

136 Stavudine Negative 137 Tacrine Negative 138 Trifluperazine Positive 139 Triparanol Positive 140 Tunicamycin Positive 141 Zimelidine Positive 142 Ceftazidime Negative 143 Carbon tetrachloride Negative 144 Valproic acid Negative

160

Page 175: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Appendix

Table A10. The 124 Descriptors calculated with ParaSurf10alpha Molecular Electrostatic Potential Descriptors

Descriptor Symbol Description maxV MEPmax Maximum (most positive)

MEP value minV MEPmin Minimum (most negative)

MEP value +V meanMEP+ Mean of the positive MEP

values −V meanMEP- Mean of the negative MEP

values V meanMEP Mean of all MEP values Δ V MEP-range MEP-range

2+σ MEPvar+ Total variance in the positive

MEP values 2−σ MEPvar- Total variance in the negative

MEP values 2totσ MEPvartot Total variance in the MEP

ν MEPbalance MEP balance parameterνσ 2

tot Var*balance Product of the total variance in the MEP and the MEP

balance parameterv1γ MEPskew Skewness of the MEP-

distribution v2γ MEPkurt Kurtosis of the MEP-

distribution

V MEPint Integrated MEP over the surface

Local Ionization Energy DescriptorsDescriptor Symbol Description

maxLIE IELmax Maximum value of the local

ionization energyminLIE IELmin Minimum value of the local

ionization energy

LIE IELbar Mean value of the local ionization energy

LIEΔ IEL-range Range of the local ionization energy

2IEσ IELvar Variance of the local

ionization energyLIE

1γ IELskew Skewness of the local ionization energy distribution

LIE2γ IELkurt Kurtosis of the local

ionization energy distribution

LIE IELint Integrated local ionization energy over the surface

Local Electron Affinity Descriptors

161

Page 176: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Appendix

Descriptor Symbol Description maxLEA EALmax Maximum of the local

electron affinity minLEA EALmin Minimum of the local

electron affinity

+LEA EALbar+ Mean of the positive values of the local electron affinity

−LEA EALbar- Mean of the negative values of the local electron affinity

LEA EALbar Mean value of the local electron affinity

LEAΔ EAL-range Range of the local electron affinity

2+EAσ EALvar+ Variance in the local electron

affinity for all positive values2

−EAσ EALvar- Variance in the local electron affinity for all negative

values 2EAtotσ EALvartot Sum of the positive and

negative variances in the local electron affinity

EAν EALbalance Local electron affinity balance parameter

+EAAδ EALfraction+ Fraction of the surface area

with positive local electron affinity

+EAA EALarea+ Surface area with positive

local electron affinityLEA

1γ EALskew Skewness of the local electron affinity distribution

LEA2γ EALkurt Kurtosis of the local electron

affinity distribution

LEA EALint Integrated local electron affinity over the surface

Local ElectronegativityDescriptor Symbol Description

maxLχ ENEGmax Maximum value of the local

electronegativityminLχ ENEGmin Minimum value of the local

electronegativity

Lχ ENEGbar Mean value of the local electronegativity

LχΔ ENEGrange Range of the local electron electronegativity

2χσ ENEGvar Variance in the local

electronegativityLχγ 1 ENEGskew Skewness of the local

electronegativity distribution

162

Page 177: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Appendix

Lχγ 2 ENEGkurt Kurtosis of the local electronegativity distribution

Lχ ENEGint Integrated local electronegativity over the

surface Local Hardness

Descriptor Symbol Description maxLη HARDmax Maximum value of the local

hardness minLη HARDmin Minimum value of the local

hardness

Lη HARDbar Mean value of the local hardness

LηΔ HARDrange Range of the local electron hardness

2ησ HARDvar Variance in the local

hardness Lηγ 1 HARDskew Skewness of the local

hardness distributionLηγ 2 HARDkurt Kurtosis of the local hardness

distribution

Lη HARDint Integrated local hardness over the surface

Local Polarizability DescriptorsDescriptor Symbol Description

maxLα POLmax Maximum value of the local

polarizability minLα POLmin Minimum value of the local

polarizability Lα POLbar Mean value of the local

polarizability LαΔ POL-range Range of the local

polarizability 2ασ POLvar Variance in the local

polarizability Lαγ 1 POLskew Skewness of the local

polarizability distributionLαγ 2 POLkurt Kurtosis of the local

polarizability distribution

Lα POLint Integrated local polarizability over the surface

Additional DescriptorsDescriptor Symbol Description

μ dipole Dipole moment Dμ dipden Dipolar density

α polarizability Molecular electronic polarizability

MW MWt Molecular weight

163

Page 178: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Appendix

G globularity Globularity A totalarea Molecular surface area

VOL volume Molecular Volume Qsum Sum of the VESPA electronic

potential on (N, O, P, S, hal, F, Cl, Br, I, H) atoms

Estate Analogous to the Kier&Hall Estate using the bond order between atom i and j instead of the distance for (N, O, P, S, F, Cl, Br, I, hal) atoms

Estate2 Analogous to the Kier&Hall Estate using rij to describe the distance between atom i and j for (N, O, P, S, F, Cl, Br, I,

hal) atoms LocPol Local polarity: All absolute

deviations from the mean ESP for each surface point summed up and divided by

the number of surface points CovHBac Covalent hydrogen bond

acidity: EHOMO(molecule)- ELUMO(water)

CovHBbas Covalent hydrogen bond basicity: ELUMO(molecule)-

EHOMO(water) EsHBac Electrostatic hydrogen bond

acidity: Most negative formal charge (molecule)-Most positive formal charge

(water) EsHBbas Electrostatic hydrogen bond

basicity: Most positive formal charge on

hydrogen(molecule)-most negative formal charge

(water)

CohIndex Cohesive index: (Nacc x Ndon0.5)/total surface

HoLu LewDon Lewis Donor LewAcc Lewis Acceptor Nacc Number of H-bond acceptors Ndon Number of H-bond donors Naryl Number of aromatic rings Npos Number of positive Nneg Number of negative

164

Page 179: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Appendix

maxNF FNmax Maximum value of the

electrostatic field normal to the surface

minNF FNmin Minimum value of the

electrostatic field normal to the surface

NFΔ FNrange Range of the field normal to the surface

NF FNmean Mean value of the field normal to the surface

2Fσ FNvartot Variance in the field normal

to the surface 2

+Fσ FNvar+ Variance in the field normal to the surface for all positive

values 2

−Fσ FNvar- Variance in the field normal to the surface for all negative

values Fν FNbal Normal field balance

parameter NF

1γ FNskew Skewness of the field normal to the surface

NF2γ FNkurt Kurtosis of the field normal

to the surface

NF FNint Integrated field normal to the surface over the surface

+

NF FN+ Integrated field normal to the surface over the surface for

all positive values−

NF FN- Integrated field normal to the surface over the surface for

all negative values

NF FNabs Integrated absolute field normal to the surface over the

surface

165

Page 180: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Appendix

Table A11. Results of the best predictions obtained

Compound Inductance AM1 (solvex)

(NB)

MNDO (solvex)

(NB)

AM1* (solvex)

(RF)

MNDO/d (solvex)

(RF)

PM3 (solvex)

(RF)

PM6(iso) (RF)

Abacavir -1 -1 -1 1 1 1 1 Bilirubin 1 -1 -1 -1 -1 -1 1 Caffeine -1 -1 -1 -1 -1 -1 -1

Carbamazepine -1 -1 -1 1 -1 -1 1 Chloroquine 1 1 1 1 1 1 1

Chlorpromazine 1 1 1 1 1 1 1 Chlortetracycline -1 -1 -1 -1 -1 -1 1

Clociguanil 1 -1 -1 -1 -1 1 -1 Colchicine -1 1 1 1 1 1 1 Cyclizine 1 1 1 1 1 1 1

Desipramine 1 1 1 1 1 1 1 Dibucaine 1 1 1 1 1 1 1 Etoposide -1 1 1 -1 -1 -1 1

Galactosamine -1 -1 -1 -1 -1 -1 -1 Gemfibrozil -1 1 -1 -1 -1 -1 1 Gentamicin 1 1 1 1 1 1 1

Hydroxyzine 1 1 1 1 1 1 1 Hypoglicin-A -1 -1 -1 -1 -1 -1 1 Ketoconazole 1 1 -1 -1 1 -1 1

Lysergide 1 1 1 1 1 1 1 Methotrexate -1 -1 -1 -1 -1 -1 -1 Methyldopa -1 -1 -1 -1 -1 -1 -1

Norchlorcyclizine 1 1 1 1 1 1 1 Nortriptyline 1 1 1 1 1 1 1 Pheniramine 1 1 1 1 1 1 1 Phentermine 1 1 1 1 1 1 1 Piroxicam -1 -1 -1 -1 -1 -1 -1 Sulindac -1 1 1 -1 1 -1 -1 Suramin 1 -1 1 -1 -1 -1 -1

Tamoxifen 1 1 1 1 1 1 1 Temozolomide -1 -1 -1 -1 -1 -1 -1

Tetracaine 1 1 1 1 1 1 1 Tobramycin 1 1 1 1 1 1 1 Tocainide 1 1 -1 1 -1 1 -1 WY-14643 -1 -1 -1 -1 -1 -1 -1

Zileuton -1 -1 -1 -1 -1 -1 -1 AC-3579 1 1 1 -1 1 1 1

Acetaminophen -1 -1 -1 1 -1 -1 -1 Amikacin 1 1 1 -1 -1 1 1

Amiodarone 1 -1 1 -1 1 1 1 AY-9944 1 1 1 1 1 1 1

166

Page 181: 3D-QSAR/QSPR Based Surface- Dependent Modeling Approach

Appendix

Bicalutamide -1 1 1 -1 -1 -1 -1 Carbon

tetrachloride -1

-1 -1 -1 -1 -1 1 Chlorcyclizine 1 1 1 1 1 1 1 Clomipramine 1 1 1 1 1 1 1

Demeclocycline -1 -1 -1 -1 -1 -1 1 Diflunisal -1 1 1 -1 -1 -1 1 Doxapram -1 1 1 -1 1 1 1

Doxycycline -1 -1 -1 -1 -1 1 -1 Emetine 1 1 1 1 1 1 1

Famotidine -1 -1 -1 -1 -1 -1 -1 Fenfluramine 1 1 1 1 1 1 1

Flutamide -1 -1 1 -1 -1 -1 -1 Homochlorcyclizine 1 1 1 1 1 1 1

Hydrazine -1 -1 -1 -1 1 -1 -1 Methadone -1 1 1 1 1 1 1 Netilmicin 1 1 1 1 1 1 1 Procaine -1 1 1 -1 -1 1 -1

Promethazine 1 1 1 1 1 1 1 Quinacrine 1 1 1 -1 -1 1 1 Quinidine 1 1 1 1 1 1 1

Rolitetracycline -1 1 1 -1 -1 -1 1 SDZ-200-125 1 1 1 1 1 1 1

Stavudine -1 -1 -1 -1 -1 -1 -1 Tacrine -1 1 1 1 1 1 -1

Trifluperazine 1 1 1 1 1 1 1 Triparanol 1 1 1 1 1 1 1

Tunicamycin 1 1 1 -1 -1 -1 1 Valproic_acid -1 -1 -1 -1 -1 -1 -1

167