5
Theory: QSAR+ descriptors
A descriptor is a molecular property that QSAR+ can calculate. QSAR+ provides a wide variety of descriptors that you can use in determining new QSAR relationships. This chapter provides information about the following functional families of descriptors available in QSAR+:
 Fragment constants descriptors
Conformational descriptors
Electronic descriptors
Receptor descriptors
Quantum mechanical descriptors
Graphtheoretic descriptors
Topological descriptors (available only if you purchase the C^{2}·Descriptors+ module)
Informationcontent descriptors (available only if you purchase the C^{2}·Descriptors+ module)
Molecular shape analysis (MSA) descriptors
Spatial descriptors (some of which are available only if you purchase the C^{2}·Descriptors+ module)
Structural descriptors
Thermodynamic descriptors
pKa descriptors (ACD Labs)
ADME descriptors
In addition, this chapter provides information on the following 3DQSAR descriptors:
 Molecular field analysis (MFA) descriptors
Receptor surface analysis (RSA) descriptors
For detailed information about how to use descriptors, see Chapter 8, Working with descriptors; Chapter 9, Working with fragment constants; Chapter 10, Performing molecular field analysis; and Chapter 11, Performing molecular shape analysis.
Additional information about descriptors (and other information) can be found in the combichem documentation.
In Startup and Configuration:

Daylight setup
Oracle setup
MDL ISIS setup
In 4d. Descriptors for library analysis:

Using molecular diversity descriptors
Statistical analyses and datamining techniques
In Theory:
 Graphtheoretic descriptors
Informationtheoretic descriptors
Descriptors based on projections of the molecular surface (shadow indices)
Descriptors based on partial charges mapped on surface area
2D and 3D fingerprints metrics
In 4e. Data analysis and library visualization:

Visualization of compounds in descriptor space
Principal component analysis
Factor analysis
Cluster analysis
Multidimensional scaling
Fragment constants descriptors
Fragment constant descriptors are constants that relate the effect of substituents on a "reaction center" from one type of process to another. The basic idea is that similar changes in structure are likely to produce similar changes in reactivity, ionization, or binding. There are different constants corresponding to different effects. These are typically used to parameterize the Hammett (or Hammettlike) equation for some series of analogs. A comprehensive introduction is found in Hansch and Leo (1995). An example is:
where kx and kh are reaction rate constants for the substituents x and h, respectively; is an electronic constant determined by an ionization constant; and is fit to the set of analogs being studied. Often, multiple terms corresponding to different properties (electronic, steric, etc.) at different Rgroup positions are used. In this way measurements of ionization constants can be used to predict rate constants, once a scaling factor () is determined. In this example measures the importance of electronic effects for the rate constant.
The default database currently contains the following types of constants. These come from Table VI1 of Hansch (1979), except for the sterimol constants, which are calculated.
Electronic effect sigma meta and sigma para, respectively. Positive values correspond to electron withdrawal, negative ones with electron release. Sigma is generally not appropriate for ortho substituents because of steric interaction with the reaction center.
Decompositions of sigma para constant into an inductive (polar) part (F) and a resonance part (R) for the case when the substituent is conjugated with the reaction center, producing throughresonance effects.
Hydrophobic character. Pi for substituent X is given by the difference of its logP from the logP for hydrogen.
Hydrogenbond acceptor.
Hydrogenbond donor.
Molar refractivity as given by:
where n is the refractive index, MW is the molecular weight, and d is the compound density.
Steric length parameter, measured along the substitution point bond axis.
Steric distances perpendicular to the bond axis. These define a bounding box for the substituent and are numbered in ascending size order.
The overall maximum steric distance perpendicular to the bond axis.
This table lists the conformational descriptors available in QSAR+:
This table lists the electronic descriptors available in QSAR+:
The sum of atomic polarizabilities (Apol) descriptor computes the sum of the atomic polarizabilities. The polarizabilities are calculated from the A coefficients used for molecular mechanics calculations:
For more information, see Marsali and Gasteiger (1980); Hopfinger (1973).
The dipole moment descriptor is a 3D electronic descriptor that indicates the strength and orientation behavior of a molecule in an electrostatic field. Both the magnitude and the components (X, Y, Z) of the dipole moment are calculated. It is estimated by utilizing partial atomic charges and atomic coordinates. Partial atomic charges are computed using the charge setup option in the QSAR control panel offering CHARMm charging rules, Gasteiger, CNDO2, and Del Re methods. The descriptor uses Debyes units.
Dipole properties have been correlated to longrange ligandreceptor recognition and subsequent binding.
For more information, see Bottcher (1952); Del Re (1963); Gasteiger (1980); Hopfinger (1973); Marsali (1980).
The HOMO descriptor adds the energy (in electronvolts) of the HOMO for each model, calculated by the CNDO/2 method, to the study table.
HOMO (highest occupied molecular orbital) is the highest energy level in the molecule that contains electrons. It is crucially important in governing molecular reactivity and properties. When a molecule acts as a Lewis base (an electronpair donor) in bond formation, the electrons are supplied from the molecule's HOMO. How readily this occurs is reflected in the energy of the HOMO. Molecules with high HOMOs are more able to donate their electrons and are hence relatively reactive compared to molecules with lowlying HOMOs; thus the HOMO descriptor should measure the nucleophilicity of a molecule.
For more information, see Fischer (1969); Pople (1970; 1967; 1965; 1966); Sichel (1968); Wiberg (1968).
The LUMO descriptor adds the energy (in electronvolts) of the LUMO for each model, calculated by the CNDO/2 method, to the study Table.
LUMO (lowest unoccupied molecular orbital) is the lowest energy level in the molecule that contains no electrons. It is important in governing molecular reactivity and properties.
When a molecule acts as a Lewis acid (an electronpair acceptor) in bond formation, incoming electron pairs are received in its LUMO. Molecules with lowlying LUMOs are more able to accept electrons than those with high LUMOs; thus the LUMO descriptor should measure the electrophilicity of a molecule.
For more information, see Pople (1970; 1967; 1965; 1966); Fischer and Kollmar (1969); Sichel (1968); Wiberg (1968).
Superdelocalizability is an index of reactivity in aromatic hydrocarbons (AH), proposed by Fukui:
 S_{r} = superdelocalizability at position r
e_{j} = bonding energy coefficient in j^{th} MO (eigenvalue)
c = molecular orbital coefficient at position r in the HOMO
m = index of the HOMO
The index is based on the idea that early interaction of the molecular orbitals of two reactants may be regarded as a mutual perturbation, so that the relative energies of the two orbitals change together and maintain a similar degree of overlap as the reactants approach one another.
Therefore, considering the interaction of the MOs of the separated reactants gives us at least an estimate of the slope of the reaction coordinate. From this we make the additional assumptions that (1) this is a prediction of the height of the transitionstate barrier at position r, and (2) the greatest interaction will occur at the site of largest orbital density, that is, largest c^{2}.
The concept of delocalizability is introduced by the e_{j} term. For lowlying levels, this energy is large and positive. We may interpret this as meaning the electrons are tightly held, that is, not very delocalizable. For the upper occupied states (especially HOMO), e_{j} is much smaller, that is, the electrons in the higherenergy orbitals are less tightly bound, which means they are relatively delocalizable. Therefore the upper energy levels will dominate the Superdelocalizability term. Consequently, summing S for all atomic positions of a molecule gives a metric of electrophilicity, which may be used to predit relative reactivity in a series of molecules.
Quantitative values, such as the interaction energy calculated in Receptor for a generated receptor model, are available to use in QSAR+. By using Receptor data to develop a QSAR model, you can evaluate the goodness of fit between a candidate structure and a postulated pseudoreceptor.
When you have generated a receptor model and have aligned the models you want to study, you can proceed to build a QSAR using data from the receptorstructure iterations.
This table lists the receptor descriptors available in QSAR+:
The internal energy of a molecule as it sits in and is constrained by the receptor model you generated.
The interaction energy of the molecule with the receptor. It is the sum of the van der Waals and electrostatic interactions. The more negative the value, the greater the interaction between the molecule and the receptor.
The electrostatic interaction energy of the molecule with the receptor.
The van der Waals interaction energy of the molecule with the receptor.
The internal energy of a molecule as it sits in the receptor site without being subject to receptor model constraints. This value should always be less than or equal to _{IntraEnergy}.
The difference in internal energy between the molecule minimized within the receptor model (IntraEnergy) and the molecule minimized without the receptor model (MinIntraEnergy).
Quantitative values calculated in the MOPAC application of the QUANTUM 1 card deck are available for use in QSAR+. These descriptors are the same as those of the same name described elsewhere in this chapter, except that the MOPAC descriptors are calculated using a semiempirical method that is likely to generate more accurate values. For information on MOPAC, see the Cerius^{2} Quantum Mechanics  Chemistry.
The following table lists the MOPAC descriptors available in QSAR+:
All the graphtheoretic descriptors included here ultimately base their calculations on representation of molecular structures as graphs, where atoms are represented by vertices and covalent chemical bonds by edges.
These descriptors fall into two categories:
All these descriptors perform their evaluations on hydrogensuppressed graphs, that is, there are no vertices corresponding to hydrogens and no edges corresponding to bonds connecting hydrogens to other atoms.
Please refer to Terms on page 95 for explanations of graphtheoretic terms and symbols used in the descriptor definitions below.
Note
Topological indices are 2D descriptors based on graph theory concepts (Kier and Hall 1976, 1986; Katritzky and Gordeeva 1993). These indices have been widely used in QSPR and in QSAR studies. They help to differentiate the molecules according mostly to their size, degree of branching, flexibility, and overall shape.
The Wiener index is the sum of the chemical bonds existing between all pairs of heavy atoms in the molecule. In graphtheoretical terms: the sum of lengths of minimal paths between all pairs of vertices representing heavy atoms. This is equal to half the sum of all Dmatrix entries (Wiener 1947, Müller et al. 1987):
Zagreb index (Zagreb)
The Zagreb index is defined as the sum of the squares of vertex valencies (Bonchev 1983):
Hosoya index (Z)
Let M be the number of edges in the graph. For any integer k, define p(k) to be the number of ways of choosing k nonadjacent edges from the graph. Note that p(k) is zero for k > [M/2], since there is no set of k nonadjacent edges in a graph of M edges if k > [M/2].
The Hosoya index is the sum of all (nonzero) p(k):
with the convention that p(0) = 1 by definition.
It is a moderately easy exercise in graph theory to prove that the formula above can also be given in terms of the following recursion (implemented in C^{2}·Diversity). Let G be the graph, of which the Hosoya index Z(G) is to be calculated. Remove an edge from G and denote the resulting graph by H. Again, remove the same edge from G, this time removing all the edges adjacent to it as well. Denote the resulting graph K. Then the following is always true:
The recursion simplifies the given graph until one or both of H and K are empty graphs, in which case the index is defined as:
There exists a handy shortcut for graphs that consist of disjoint subgraphs (see an example calculation of Z(benzene) below)  if G consists of disjoint subgraphs H and K, then:
Example
The index displayed in the study table is the natural logarithm of Z, to handle the rapid growth of the index with molecule size (Hosoya 1971, Rouvray 1987).
This index, refined by Kier and Hall (1976), is a series of numbers designated by "order" and "subgraph type."
There are four subgraph types: Path, Cluster, Path/Cluster, and Chain. These types emphasize different aspects of atom connectivity within a molecule  the amount of branching ring structures present and flexibility. Here we refer to these subgraph types as P, C, PC, and CH, respectively. They are defined as follows:
Definition
Given a connected subgraph G:
(i) If G contains a cycle it is of type CH (chain).
Otherwise:
(ii) If all vertex valencies of G (valencies with respect to G, not the entire graph) are either greater than 2 or equal to 1, G is of type C (cluster).
Otherwise:
(iii) If all vertex valencies (as in (ii)) are either equal to 2 or 1, G is of type P (path).
Otherwise:
(iv) G is of type PC (Path/Cluster). That means the valencies greater than 2, equal to 2, and equal to 1, are all present.
"Order" refers to the number of edges in a subgraph. The allowable orders are 0, 1,..., M (M  the number of edges in the entire graph)
Notes:
(a) Subgraphs of order 0 are assigned the class P (path). (b) Subgraphs of order 1 and 2 are necessarily of type P only. (c) Subgraphs of order 3 can be of type P, C, or CH only.

Molecular connectivity index of order n corresponding to subgraph type s is denoted by n s.
Given an order n and a subgraph type s, one considers all connected subgraphs of type s consisting of n edges. For each vertex v_{i} in a subgraph, its valence _{i} (with respect to the entire graph) is calculated and the partial index nP corresponding to the given subgraph is found according to:
(n = number of subgraph vertices).
Finally, the partial indices are summed over all connected subgraphs of the requested type s (Kier and Hall 1976, 1985):
Example of the molecular connectivity index
If we calculate molecular connectivity indices for methane and the fluorinated methanes, the following results are obtained in the study table. There is one row for each molecule as usual, and some columns for each type of subgraph. In the Topological Descriptors control panel, subgraph orders from 0 to 3 are specified as the default, so we see CHI0 through CHI3 columns. Had the range been 0 to 4, we would have seen a CHI4 column as well.
Let us consider the order zero indices first, in the first column (CHI0), which represent the simplest subdivision or subgraph: the set of vertices. The number of subgraphs of order zero is therefore equal to the number of skeletal atoms or vertices. Each vertex has a property , which is the number of its electrons in sigma bonds to skeletal neighbors.
Where:
 = number of electrons in bonds to all neighbors.
h = number of H atoms bonded to atom i.
The zeroe^{th}order subgraph connectivity weight assigned to each vertex is:
The order zero index is the sum of all vertex weights in the graph, that is, over all atoms in the skeleton.
Methane
Thus for methane, there is only one skeletal atom, C. It has four of its electrons in bonds and is bonded to four H atoms, and therefore has = 0 (that is, 4  4), and is assigned a index of 0.
Fluoromethane
Difluoromethane
The zeroe^{th}order index holds little structural information. Only the presence of the nearest neighbor to each atom is captured. In the series methane through tetrafluoromethane, we see an increase in CHI0, which reflects the increasing size of the molecule skeleton.
Order one indices are the graph edges, that is, the bonds that connect the skeletal atoms. We replace the atom with the product of the values of the vertices or atoms that form the edge or bond. Thus, the edge between vertices i and j is:
and as before, we sum all the weights to obtain the firstorder index.
Methane
Leaving out hydrogens, the molecular graph of methane is a single point. It has no edges and therefore has firstorder index = 0.
Fluoromethane
Fluoromethane has one edge, representing the CF bond.
Difluoromethane
Difluoromethane has two edges.
Firstorder indices contain more structural information than zeroethorder indices. The firstorder index encodes the number of edges (bonds) in the molecular graph. Hence CHI  1 increases throughout the series methane through tetrafluoromethane. Beyond this, the immediate bonding environment of an atom is captured in the edge weights: the weight of the carbon atom becomes smaller as it becomes more substituted. This reduces the rate of increase of CHI  1 compared to CHI  0 over the same series.
In an alicyclic compound containing A atoms, the number of skeletal bonds is:
where P is called the number of paths of length 1. A "path of length 1" is a bond.
In a cyclic compound with R rings:
Thus the number of firstorder weights encodes the number of rings.
For order two, we consider pairs of edges (bonds) in the molecular graph. Since methane has no bonds and fluoromethane has only one bond, CHI  2 for methane and fluoromethane are zero. Difluoromethane has one path of length 2:
This is computed in a manner analagous to the lowerorder indices, as a product of reciprocal square roots:
Thus for difluoromethane:
Trifluoromethane has three paths of length 2:
None of the compounds have any paths of length 3, which would require three edges (that is, three bonds) to be connected, so the CHI  3_P values for the series are all zero. On the other hand, 1,2difluoroethane, had we included it, would have a CHI  3_P index of 0.5.
However, there is another kind of thirdorder subgraph called a cluster, which involves four skeletal atoms in a trigonal relationship. In this example, this structural motif appears only in trifluoromethane and tetrafluoromethane.
The smallest possible ring is three membered and, if there were any threemembered rings in our set, they would be captured by the CHI  3_CH (CH for "chain", meaning "ring").
Similarly, none of our compounds have any paths of length 4, which would require four connected edges, hence all values for the CHI  4_P index are zero. Tetrafluoromethane contains a fourthorder cluster, however:
The higherorder indices are additive (because they are sums of weighting terms) and constitutive (because the size of the weights depends on atomic values), representing the entire molecular graph.
This index is a refinement of the molecular connectivity index (see page 77 for definitions) where a vertex subgraph valence is enhanced to v to take into account electron configuration of the atom represented by the vertex:
where Zv is the number of valence electrons in the atom, Z is its atomic number, and h is the number of hydrogens bound to it. This formula is designed to reproduce the unmodified molecular connectivity index for saturated hydrocarbons, for which v = . However, v distinguishes between multiple and single bonds. The denominator introduces further distinction between element rows due to the presence of the atomic number Z (Kier and Hall 1976, 1985).
This is the number of subgraphs of a given type and order (Kier and Hall 1976). (See Kier & Hall molecular connectivity index (c) for definitions.)
SC0
Refers to the number of zeroorder subgraphs in the molecular graph. The number of subgraphs of order zero is simply the number of skeletal atoms or vertices in the molecular graph.
SC1
The number of firstorder subgraphs in the molecular graph, which is the number of edges that connect the vertices of the molecular graph. In other words, it is the number of bonds in the molecule.
SC2
The number of secondorder subgraphs in the molecular graph, which is the number of pairs of connected edges. In other words, it is the number of paths of length 2.
There are three types of thirdorder subgraph: Path, Cluster and Ring.
SC3_P
The number of thirdorder subgraphs in the molecular graph: the number of paths of length 3.
SC3_C
Counts the number of clusters.
SC3_CH
Counts the number of rings or chains.
These indices compare the molecule graph with "minimal" and "maximal" graphs, where the meaning of "minimal" and "maximal" depends on the order n. This is intended to capture different aspects of the molecular shape.
Order 1:
The descriptor _{1} encodes the count of atoms and the presence of cycles relative to the minimal and maximal graphs. For N vertices, the maximal graph includes edges between all vertex pairs. For the minimal graph a linear path of N  1 edges connecting the vertices is taken.
The shape index of order 1 is then defined as:
where P is the number of edges in the graph (edges are paths of length 1, hence the subscript on the _{1}), Pmax is the number of edges in the maximal graph  namely N(N  1)/2  and Pmin is the number of edges in the minimal graph  namely N  1.
By inserting the formulas for Pmax and Pmin, one obtains the implemented formula:
Order 2:
The descriptor _{2} encodes the branching. P, Pmin, and Pmax now denote the number of paths of length 2 in the corresponding graphs. The maximal graph is taken to be the star graph in which all atoms are adjacent to a common atom. Thus, Pmax = (N  1) (N  2)/2. The linear graph is again taken as the minimal graph, so Pmin = N  2. Eq. 42 above thus yields:
Order 3:
For order 3, the counts of paths of length 3 are considered, and the maximal graph chosen is a twinstar (Kier 1990) with Pmax = (N  1) (N  3)/4 for N odd and Pmax = (N  2)^{2}/4 for N even. The minimal graph is again the linear one with Pmin = N  3.
Eq. 42 is adjusted by another factor of 2  in the words of the index designer  "to bring the values into rough equivalence with the other kappa values" (Kier 1990, Hall and Kier 1991):
Kier's alphamodified shape indices (n (n = 1, 2, 3))
These indices are refinements of the shape index (see previous section) that take into consideration the contribution covalent radii and hybridization states make to the shape of the molecule. The indices ^{}n are defined by Eq. 43  45, with the atom count N replaced by the modified atom count N + . The modifier is defined as:
where the summation is over all heavy atoms of the molecule. Here, ri is the radius of the i^{th} heavy atom and rCsp3 is the radius of the sp^{3} carbon (taken to be 0.77 Å in this implementation). In this calculation the following atoms are considered to be heavy: C, N, O, F, P, Cl, Br, and I (Kier 1990, Hall and Kier 1991).
This is a descriptor based on structural properties that restrict a molecule from being "infinitely flexible", the model for which is an endless chain of C(sp^{3}) atoms. The structural features considered as preventing a molecule from attaining infinite flexibility are: (a) fewer atoms, (b) the presence of rings, (c) branching, and (d) the presence of atoms with covalent radii smaller than those of C(sp^{3}). These features are encoded in the index as follows:
where N = number of vertices (Hall and Kier 1991).
This is a highly discriminating descriptor, whose values do not substantially increase with molecule size and the number of rings present ( Balaban 1982, Balaban and Ivanciuc 1989). Its evaluation begins with the Dmatrix modified as follows:
Having constructed the modified Dmatrix, the row sums are calculated:
where N is the number of vertices and i = 1, ... , N.
At this stage the contributions based on heteroatom electronegativities and heteroatom covalent radii are included by modifying the si values. The modifiers are twoparameter approximations of electronegativities and covalent radii relative to those of carbon. The exact formulas used in the index calculations are:
where i is the atomic number and Gi is the (short) periodic table group number. These modifiers are used only with nonmetals: B, C, N, O, F, Si, P, S, Cl, As, Se, Br, Te, and I. For other heteroatoms the values are set at X = Y = 1.
Given the values of X and/or Y for each vertex, the numbers si are adjusted as follows:
sai = X si (for the index JX)
sai = Y si (for the index JY)
and the result inserted in the final formula for the index:
where J equals either JX or JY, depending on the modifier type used, M is the number of edges, and N is the number of vertices, and the sum is over all pairs (i, j) with adjacent vertices vi and vj.
Note
In this approach, molecules are viewed as structures that can be partitioned into subsets of elements that are in some sense equivalent. The notion of equivalence depends on the particular descriptor. Consider a partition of a set of N elements into k subsets each consisting of Nk elements:
equivalence class: 1 2 ... k
number of elements in each: N_{1} N_{2} ... Nk
N_{1} + N_{2} + ... + Nk = N
Given a partition P as above, we use the notation:
P = N (N_{1}, N_{2}, ... , Nk).
A probability distribution can be associated with the partition:
pi = Ni / N,
the probability for a randomly chosen element to belong to class i. This degree of uncertainty can be also expressed by the entropy:
Hi =  lb pi (lb is the base2 logarithm).
The mean entropy of such a probability distribution is then:
which, according to Shannon's statistical information theory (Bonchev 1983, and references therein), can be viewed as a measure of the mean quantity of information contained in each structure element (in bits per element).
The partition P, the probabilities pi and the mean quantity of information H form the pattern of calculation for all the informationtheoretic descriptors.
The atoms in the molecule are partitioned into equivalence classes corresponding to their atomic numbers. The partition then yields the descriptor IACMean as the mean quantity of information H as defined above.
The descriptor IACTotal is defined as N X IACMean, where N is the number of atoms in the molecule.
The two information indices in this category are:
These indices (and several others below) are based on partitioning elements of the Amatrix according to two basic modes:
The example below should make this clear.
Vertex adjacency/equality
The Amatrix consists of zeros and ones, so the partitioning consists of two classes:
P = N 2 (2M, N 2  2M)
with M equal to the number of edges (thus 2M equals the number of ones in the Amatrix) and N equal to the number of vertices (N 2
 2M is the number of zeros in the Amatrix).
Therefore:
Vertex adjacency/magnitude
Each matrix element aij is now treated as an equivalence class of aij elements. In this case, each equivalence class consists of either one or zero elements, so the partition is (discarding the classes of zero elements):
P = 2M( 1, 1, ... , 1 ) (2M ones)
The index V_ADJ_mag is thus rather simple:
Information indices based on the Dmatrix
Two types of indicesare based on this matrix:
These descriptors are defined in exactly the same manner as the vertex adjacency indices, except that the distance matrix is used instead of the adjacency matrix.
The indices based on these matrices are:
These are the descriptors based on the edge adjacency and the edge distance matrices, in exact analogy with those given in the section Information indices based on the Amatrix.
To each vertex v, an unordered sequence of ordered pairs is assigned:
{ (m1, n1), (m2, n2), ... , (mk, nk) }, called a coordinate, such that:
k = the valence of the vertex (there is one ordered pair (m_{j}, n_{j}) per each neighboring vertex, v_{j}), and for every j = 1, ..., k:
Having assigned the coordinates to vertices, the partition of vertices is constructed in the usual way, where two vertices are considered equivalent if their coordinates are the same (as unordered ktuples, i.e., the repetitions of ordered pairs are not ignored, as they would be if we treated the ktuples purely as sets).
The index corresponding directly to this partition is the index IC ("Information Content").
The following indices are normalizations of IC:
The CIC ("Complementary Information Content") measures the deviation of IC from its maximum possible value corresponding to the partition into classes containing one element each:
ICmax = N X (1/N) X lb(1/N) = lb(N)
and thus the CIC index is defined as:
(Sarkar et al. 1978, Bonchev et al. 1981, Bonchev 1983, Katritzky et al. 1993.)
Terms
 © Path: an alternating sequence of vertices v and edges e beginning and ending with vertices in which each edge is adjacent to the two vertices immediately preceding and following it.
If there is a path of the form: v0, e0, v1, e1, ..., vn1, en1, vn we say the vertices v0 and vn are connected by this path.
 © Path length: the number of edges in the path.
 © Minimal path: a path of minimal length among all paths connecting a given vertex pair.
 © Cycle: a path v0, e0, ..., vn with v0 = vn .
 © Connected graph: a graph is connected if, for any pair of vertices, there is a path connecting them.
 © Subgraph: a subset of the set of vertices and edges of the original graph which is itself a valid graph (namely, with each edge it contains the vertices adjacent to it).
 © Vertex valence: number of edges adjacent to a vertex. Note that ith vertex valence equals the sum of matrix elements in the ith row (or column) of the Amatrix. It is denoted by i
 © Adjacency matrix (Amatrix): a symmetric N x N matrix {aij} defined as:
N = number of vertices, aij = 1 if the vertices vi and vj are connected by an edge, aij = 0 otherwise.
 © Edge adjacency matrix (Ematrix): a symmetric M x M matrix {ekl} that is in a sense "complimentary" to the Amatrix and is defined as:
M = number of edges, ekl = 1 if edges ek and el share exactly one common vertex, ekl = 0 otherwise.
 © Distance matrix (Dmatrix): a symmetric N x N matrix a^{D}_{ij} (N = number of vertices) where a^{D}_{ij} is the number of edges in a path of minimal length connecting vi to vj.
 © Edge distance matrix (EDmatrix): a symmetric M x M matrix {e^{D}_{kl}},(M = number of edges) where e^{D}_{kl} is the number of vertices in a path of minimal length connecting ei to ej (not counting the terminal vertices of the path).

This table lists the MSA descriptors available in QSAR+
The common volume between each individual molecule and the molecule selected as the reference compound. This is a measure of how similar in steric shape the analogs are to the shape reference.
The difference between the volume of the individual molecule and the volume of the shape reference compound.
The common overlap steric volume descriptor divided by the volume of the individual molecule.
The volume of the individual molecule and the common overlap steric volume.
Root mean square (rms) deviation between the individual molecule and the shape reference compound.
The volume of the shape reference compound.
This table lists the spatial descriptors available in QSAR+:
This set of geometric descriptors helps to characterize the shape of the molecules. The descriptors are calculated by projecting the molecular surface on three mutually perpendicular planes, XY, YZ, and XZ (Rohrbaugh and Jurs 1987). These descriptors depend not only on conformation but also on the orientation of the molecule. To calculate them, the molecules are first rotated to align the principal moments of inertia with the X, Y, and Z axes.
A total of 10 descriptors are calculated in this set:
1. Area of the molecular shadow in the XY plane (Sxy).
2. Area of the molecular shadow in the YZ plane (Syz).
3. Area of the molecular shadow in the XZ plane (Sxz).
4. Fraction of area of molecular shadow in the XY plane over area
of enclosing rectangle (Sxy,f).
5. Fraction of area of molecular shadow in the YZ plane over area
of enclosing rectangle (Syz,f).
6. Fraction of area of molecular shadow in the XZ plane over area
of enclosing rectangle (Sxz,f).
7. Length of molecule in the X dimension (Lx).
8. Length of molecule in the Y dimension (Ly).
9. Length of molecule in the Z dimension (Lz).
10. Ratio of largest to smallest dimension ().
Jurs descriptors based on partial charges mapped on surface area
This set of descriptors (Stanton and Jurs 1990) combines shape and electronic information to characterize molecules. The descriptors are calculated by mapping atomic partial charges on solventaccessible surface areas of individual atoms. A total of 30 different descriptors are included in the set:
1. Partial positive surface area: sum of the solventaccessible surface
areas of all positively charged atoms (PPSA1).
2. Partial negative surface area: sum of the solventaccessible surface
areas of all negatively charged atoms (PNSA1).
3. Total charge weighted positive surface area: partial positive solventaccessible
surface area multiplied by the total positive
charge (PPSA2).
4. Total charge weighted negative surface area: partial negative
solventaccessible surface area multiplied by the total negative
charge (PNSA2).
5. Atomic charge weighted positive surface area: sum of the product
of solventaccessible surface area X partial charge for all
positively charged atoms (PPSA3).
6. Atomic charge weighted negative surface area: sum of the
product of solventaccessible surface area X partial charge for
all negatively charged atoms (PNSA3).
7. Difference in charged partial surface areas: partial positive solventaccessible
surface area minus partial negative solvent
accessible surface area (DPSA1).
8. Difference in total charge weighted surface areas: total charge
weighted positive solventaccessible surface area minus total
charge weighted negative solventaccessible surface area
(DPSA2).
9. Difference in atomic charge weighted surface areas: atomic
charge weighted positive solventaccessible surface area minus
atomic charge weighted negative solventaccessible surface
area (DPSA3).
10¯15. Fractional charged partial surface areas: set of six descriptors
obtained by dividing descriptors 1 to 6 by the total molecular
solventaccessible surface area (FPSA1, FPSA2, FPSA3,
FNSA1, FNSA2, FNSA3).
16¯21. Surfaceweighted charged partial surface areas: set of six
descriptors obtained by multiplying descriptors 1 to 6 by the
total molecular solventaccessible surface area and dividing by
1000 (WPSA1, WPSA2, WPSA3, WNSA1, WNSA2, WNSA
3).
22. Relative positive charge: charge of most positive atom divided
by the total positive charge (RPCG).
23. Relative negative charge: charge of most negative atom divided
by the total negative charge (RNCG).
24. Relative positive charge surface area: solventaccessible surface
area of the most positive atom divided by descriptor 22 (RPCS).
25. Relative negative charge surface area: solventaccessible surface
area of most negative atom divided by descriptor 23
(RNCS).
26. Total hydrophobic surface area: sum of solventaccessible surface
areas of atoms with absolute value of partial charges less
than 0.2 (TASA).
27. Total polar surface area: sum of solventaccessible surface areas
of atoms with absolute value of partial charges greater or equal
than 0.2 (TPSA).
28. Relative hydrophobic surface area: total hydrophobic surface
area divided by the total molecular solventaccessible surface
area (RASA).
29. Relative polar surface area: total polar surface area divided by
the total molecular solventaccessible surface area (RPSA).
30. Total molecular solventaccessible surface area (SASA).
Molecular surface area (Area)
The molecular surface area descriptor is a 3D spatial descriptor that describes the van der Waals area of a molecule. The molecular surface area determines the extent to which a molecule exposes itself to the external environment. This descriptor is related to binding, transport, and solubility.
The radius of gyration is calculated using the following equation:
where N is the number of atoms and x, y, z are the atomic coordinates relative to the center of mass.
A 3D spatial descriptor that is defined as the ratio of molecular weight to molecular volume. It has the units of g ml^{1}. The density reflects the types of atoms and how tightly they are packed in a molecule. Density can be related to transport and melt behavior.
Calculates the principal moments of inertia about the principal axes of a molecule according to the following rules:
For more information about this descriptor, see Hill (1960).
A 3D spatial descriptor that defines the molecular volume inside the contact surface. The molecular volume is calculated as a function of conformation. Molecular volume is related to binding and transport.
This table lists the structural descriptors available in QSAR+:
Counts the number of bonds in the current molecule having rotations that are considered to be meaningful for molecular mechanics. All terminal H atoms are ignored (for example, methyl groups are not considered rotatable).
This table lists the thermodynamic descriptors available in QSAR+:
LogP (the octanol/water partition coefficient) and molar refractivity are molecular descriptors that can be used to relate chemical structure to observed chemical behavior. LogP is related to the hydrophobic character of the molecule. The molecular refractivity index of a substituent is a combined measure of its size and polarizability.
The QSAR+ descriptor ALogP and molar refractivity are calculated using the method described by Ghose & Crippen (1989). In this atombased approach, each atom of the molecule is assigned to a particular class, with additive contributions to the total value of
logP and molar refractivity.
For more information, see Leffler and Grunwald (1963).
AlogP98 descriptor
The AlogP98 descriptor is an implementation of the atomtypebased AlogP method using the latest published set of parameters (Ghose et al. 1998).
F_{oct} and F_{H2O} are physiochemical properties associated with LFE models of a molecule. These properties have proven useful as molecular descriptors in structureactivity analyses. All LFE computations are based solely on the connectivity of the atoms in a molecule. LFE computations are not conformationally dependent.
F_{oct} is the 1octanol desolvation free energy and F_{H2O} is the aqueous desolvation free energy derived from a hydration shell model developed by Hopfinger, where F_{oct} and F_{H2O} are in kcal mol^{1}.
QSAR calculates F_{H2O} and F_{oct}_{ }for each molecule by searching the molecule for recognizable substituent groups and their bonding patterns and summing the substituent constants contributions for each group that is present in the molecule.
For more information, see Hopfinger (1973; 1980) Pearlman (1980).
The enthalpy for forming a molecule from its constituent atoms, a measure of the relative thermal stability of a molecule. This descriptor is calculated using the MNDO semiempirical molecular orbital method of Dewar. MNDO is the most rigorous quantumchemical technique available in QSAR+ and has a wide range of applicability in conformational analysis, intermolecular modeling, and chemical reaction modeling. The atom limit of MNDO is 300 atoms or 300 atomic orbitals (whichever is less) per molecule. The atoms treated by MNDO are: H, B, C, N, O, F, Al, Si, P, S, and Cl.
For more information, see Dewar amd Thiele (1977a; 1977b).
PK_{a}s are calculated and the results displayed in the study table according to userdefined rules.
The pKa program, available separately from Advanced Chemistry Development (ACD), is needed for use of this descriptor. You can contact ACD through their website at www.acdlabs.com.
Molecular field analysis (MFA) evaluates the energy between a probe and a molecular model at a series of points defined by a rectangular or spherical grid. These energies may be added to the study table to form new columns headed according to the probe type. The new columns may be used as independent X variables in the generation of QSARs. For more information about working with MFA descriptors, see Chapter 10, Performing molecular field analysis.
For a theoretical description of receptor surface models, please see Cerius^{2} Hypothesis and Receptor Models, which touches briefly on functionality in the Receptor module called receptor surface analysis (RSA).
If you have used Receptor, you may already be familiar with the idea of using the energy of interaction between a drug model and a receptor surface model to calculate a QSAR. For an example of this, run the demonstration log file Cerius2Resources/ EXAMPLES/demos/DDW_receptordemo2.log. The energies of interaction between the receptor surface model and each molecular model are added to the study table as new columns, which you can use for generating QSARs. These energies may be added to the study table with the Receptor_energies descriptor.
An additional descriptor, Receptor_RSA, allows you to add the energy of interaction between each point on the receptor surface and each model to the study table and use these surface point energies to calculate a QSAR. Instead of one total number that is the sum of the interactions evaluated between each point on the surface and each molecular model, leading to one extra column in the study table, you now have available the energies at each surface point.
Depending on the size of the drug molecules, this is potentially a great number of surface points. Filtering methods are available to reduce the input to the study table, based on the variance of the energies at any point, correlation of the energies with activity data, or simply adding every n^{th} point.
The technique resembles CoMFA but, instead of a rectangular grid, the points considered are taken from the receptor surface. Therefore they are probably more chemically relevant than a rectangular grid, because they exist on a surface that is shaped like a molecule, and even better, a surface constructed from a subset of active molecules.
After adding the receptor surface point energies to the study table, you may calculate a QSAR using the receptor surface energies and biological activities. Early tests indicate that if the genetic function algorithm (GFA) method is used, nonlinear terms must be included.
 Intestinal Absorption Model Approximately 200 well absorbed molecules, of which 181 were drugs or druglike, were used to develop a pattern recognition model for predicting passive intestinal absorption. The model was developed using robust outlier detection methods to identify and remove actively transported molecules. Descriptor space was chosen to be AlogP98 and van der Waals Polar Surface Area (PSA) for nitrogen and oxygen atoms and attached hydrogen atoms Multivariate distance (T2) from the center of PSAAlogP98 space was computed and cutoffs were used to classify test set molecules. Fast PSA (FPSA) was used to rapidly and accurately approximate static PSA (R2 = 0.996, RMSE = 5.5 Å2) Results demonstrated that 91.5% of highly Caco2 permeable molecules are classified as well absorbed or moderately absorbed.
 BBB Penetration Model Features Two models were developed:
a. 1) a robust regression (least medianofsquares) to predict
logBB values based on over 120 compounds.
b. 2) a BBB confidence ellipse (derived from over 800 compounds
classified as CNS therapeutic) after robust outlier
detection.
 The regression model has an R2 = 0.889 and RMSE = 0.31. Molecules outside the BBB confidence ellipse are predicted to have extremely poor logBB, so no numerical prediction is given via regression. The model has excellent performance separating the BBB penetrant compounds from the nonBBB penetrant compounds in the PDR dataset.
 Water Solubility Model Features The experimental solubility values of 784 compounds were used to develop the model using genetic partial least squares regression. The model has an R2= 0.84 and test set RMSE=0.87. The model predicts the aqueous solubility of a compound at 25 C and reports the predicted solubility and its solubility ranking relative to the solubilities of other drug molecules. The mod el has an RMSE=1.0 for a large diverse test set of 1615 compounds including molecules from the PDR and the CMC.
Last updated June 13, 2001 at 03:27PM Pacific Daylight Time.
Copyright © 2001, Accelrys. All rights
reserved.