Integer Quadratic Programming Models in Computational Biology

[Pages:14]Integer Quadratic Programming Models in Computational Biology

Harvey J. Greenberg

Department of Mathematical Sciences University of Colorado at Denver and Health Sciences Center harvey.greenberg@cudenver.edu

1 Introduction

This presentation has two purposes: (1) show operations researchers how they can apply quadratic binary programming to current problems in molecular biology, and (2) show formulations of some combinatorial optimization problems as integer programs. The former purpose is primary, and I wish to persuade researchers to enter this exciting frontier. The latter purpose is part of a work in progress.

I begin with some background, particularly in the biology; mathematical programming terms can be found in the Mathematical Programming Glossary [10]. I then present four integer quadratic programming (IQP) models. My last section indicates some avenues for research.

2 Background

In this section I introduce a few things about the underlying biology, which are needed to get started. I then give the general form of the underlying model and discuss some of its attributes. In both cases, I elaborate in the next section in the context of each model.

2.1 Molecular Biology

All life depends upon three critical molecules: 1. DNA, which contains information about how a cell works 2. RNA, which makes proteins and performs other functions 3. Proteins, which are regarded as the workers of the cell.

DNA is a double-stranded sequence of nucleic acids: Adenine, Cytosine, Guanine, and Thymine. RNA is a single-stranded sequence of nucleic acids:

2

Harvey J. Greenberg

Adenine, Cytosine, Guanine, and Uracil. The genetic code maps each triple of nucleic acids into one of 20 amino acids. A peptide bond is a bonding of two amino acids. A protein is determined by a sequence of successively bonded amino acids. Figure 1 shows the bonding of two amino acids -- the carboxyl end of one bonds with the amino end of the other, discarding water. It then shows a generic backbone defined by a sequence of amino acids.

R1 O

NH2 CH C OH

R2 O

H N CH C

H

OH

water goes away

R1 O

NH2 CH C

R2 H

N CH

O C

OH

residues (or side chains)

R1 O

NH2 CH C

R2 O

N CH C

Rn

N ... CH COOH

H

H

amino end N terminus

carboxyl end C terminus

backbone

Fig. 1. Bonding Amino Acids

The central dogma is the information flow from DNA to proteins. (This use of "dogma" may seem strange -- see [5, p. 116] for an explanation.) Figure 2 illustrates this. I shall say more as needed, in the context of each problem.

2.2 Integer Quadratic Programming

The IQP model has the form:

opt xQx + dy : x {0, 1}, x X(y), y Y,

where y are continuous variables (which may be absent), restricted to the polyhedron, Y , and X(y) restricts the binary decision variable (x) to a polyhedron that depends upon y. (Think of this as some system like Ax + By b.)

One approach is to linearize the objective function by introducing auxiliary variables, w, replacing the quadratic form with cw. There are different ways to do this, and I refer you to Adams, et al. [1]. Another approach is to convexify the quadratic form by adding or subtracting diagonal terms from the fact that x2 = x for x {0, 1}. In particular, we can choose such that adding

IQP Models in Computational Biology

3

Fig. 2. The Central Dogma of Molecular Biology

j j(x2j - xj) to the objective renders the quadratic form positive definite. Just as there are variations in linearization, so there are algorithm design choices in the convexification [3, 8] (e.g., choosing ).

This presentation is about the models, not the algorithms to solve them. All of the models I present have the following properties.

? They are NP-complete. ? Some special cases have polynomial algorithms. ? Some have approximation algorithms. ? Some have been analyzed for deep cuts and pre-processing. ? Very few have been analyzed for linearization. ? None have been analyzed for convexification.

One may ask, "Why use a general IQP approach?" This must take more computer time and space to solve it, compared to a special-purpose algorithm for each particular problem. My answer is that we gain the following:

Understanding. Seeing the algebraic structure of one model offers a vantage for insights not necessarily obtained from a purely combinatorial vantage.

Extension. Often, we want to add constraints or otherwise extend the basic model. This is easy to do with an IQP approach, but it might require an entirely new algorithm design for a direct, problem-specific approach.

Management. It is easier to manage an IQP model, particularly with good management practices, such as separation of model specification from data that determines instances of the model.

Transfer. We can compare IQP models of different biology problems and gain insights into all of them that we might otherwise miss.

4

Harvey J. Greenberg

3 Models

In this section I present four IP models with quadratic objectives; other recent surveys with linear objectives can be found in [4, 7]

3.1 Multiple Sequence Alignment

Two fundamental biological sequences are taken from the alphabet of nucleic acids, {a,c,g,t}, or from the alphabet of amino acids, {A,R,N,D,C,Q,E,G, H,I,L,K,M,F,P,S,T,W,Y,V}. The former are segments of DNA (or RNA if t is replaced by u); the latter are segments of proteins.

The biology problem is to seek similarities among a given set of sequences from the same alphabet. This might be to:

? Understand life through evolution ? Identify families of proteins to infer structure or function from sequence ? Diagnose disease ? Retrieve similar sequences from databases.

The problem is called Multiple Sequence Alignment (MSA). An early application of MSA that illustrates its importance is given by

Riordan, et al. [14], who discovered the Cystic Fibrosis Transmembrane Regulator gene and its connection to Cystic Fibrosis. Even before then, algorithms for MSA [13, 15] were developed for the biologists who used computational methods for their research.

One key to defining an objective is the notion of a gap. This is a sequence of insertions or deletions (called indels) that occur during evolution. For example, if one applies a simple Hamming distance to the sequences, acacta and tacact, they are six characters apart. However, inserting gaps we obtain the extended sequences,

-acacta |||||

tacact-

This has a Hamming distance of only two. The middle sequence, acact, is the same in the alignment. We may suppose there was a common ancestor that inserted a at the end to produce the first sequence and deleted t at the beginning to obtain the second.

Two sequences can be optimally aligned by dynamic programming, where "optimal" is one that maximizes an objective that has two parts:

1. a scoring function, given in the form of an m ? m matrix S, where m is the size of the alphabet. The value of Sij measures a propensity for the ith alphabet-character in one sequence to align with the jth alphabetcharacter in some position of the other sequence. Example: Let s = agg and t = gac. In the direct alignment, the total score is Sag + Sga + Sgc.

IQP Models in Computational Biology

5

2. a gap penalty function, expressed in two parts: a "fixed cost" of beginning a gap, denoted Gopen, and a cost to "extend" the gap, denoted Gext. Example: Let s = agg and t = cgag. One alignment is to put a gap in the first sequence: ag-g cgag

Figure 3 shows four different alignments for the two nucleic acid sequences, agt and gtac. Suppose the scoring matrix is

acg t

2 -1 -2 0

S

=

-1

-2

2 0

0 2

-2 -1

.

0 -2 -1 2

Then, the scores (without gap penalties) are 4, 0, and -3, respectively.

agt-||

-gtac

-a-gt gtac-

agtgtac

Fig. 3. Three Alignments for Two Sequences

The total objective function for the 2-sequence alignment problem has the form

Ssitj - Gopen(Ns + Nt) - Gext(Ms + Mt),

i,j

where the sum is over aligned characters, si from sequence s with tj from sequence t. The number of gaps opened in sequence s is Ns, and it is Nt in sequence t. The number of gap characters (-) is Ms in sequence s and Mt in sequence t. In the example of fig. 3, if Gopen=2 and Gext=1, the gap penalties are 7, 9, and 3, respectively.

Subtracting the gap penalties from the scores, the objective values of the alignments are -3, -9, and -6, respectively. The first alignment is (uniquely) optimal.

There are different scoring methods, but this is the common one, which I use. One way to evaluate an MSA is by summing pairwise scores. Figure 4 shows an example. Using the same scoring matrix as above, the sum-of-pairs score is shown for each column. For example, column 1 has 3Saa + 3Sac = 3. The sum of pairwise scores for column 2 is zero because we do not score the gaps by columns; they are penalized for each sequence (row of alignment). The total objective value is 31 - 28 = 3.

The IQP model is as follows. Let

6

Harvey J. Greenberg

Gap penalty

a-gagt-act---

8

aagtat--at---

7

a--tataa----t

8

c-gta--actcct

5

score: 3066020606002

28 = Total

Total = 31

Fig. 4. A Multiple Alignment of Four Sequences

xik =

1 if ith character of sequence is assigned to column k ; 0 otherwise.

yk =

1 if sequence has a gap in column k ; 0 otherwise.

zk =

1 if sequence opens a gap in column k ; 0 otherwise.

Then, the IQP for the MSA of sequences s1, . . . , sm is given by:

max k p> subject to:

i,j Ssip sj xikxjpk -

k, Gopenzk + Gextyk

k xik = 1 i, k>k xi+1,k xik i < L, , k yk + i xik = 1 k, yk - yk-1, - zk 0 k,

x {0, 1}, 0 y, z 1.

I do not explicitly require y, z to be binary; that is implied by x binary for basic solutions. (An interior solution yields fractional values of z if Gopen=0.)

The first sum in the objective is the sum-of-pairs score, from which we subtract the total gap penalty. The first constraint requires that each character (i) in each sequence () is assigned to some column (k). The second constraint preserves the character order of each sequence -- if the ith character of string is assigned to column k (xik = 1), then its successor (i+1) must be assigned to a subsequent column (k > k). The third constraint requires that, for each column of each sequence, either a character is assigned or it is in a gap. Finally, the last constraint requires that a gap is opened (i.e., zk is forced to 1) if it changed from no gap assignment (yk-1 = 0) to a gap assignment (yk = 1).

3.2 Lattice Protein Folding

We are given the amino acid sequence of a protein, and we want to know its structure after it folds -- that is, twists and turns to establish atomic bonds.

IQP Models in Computational Biology

7

Structures include the alpha helix and beta sheet, and these determine (in some way) the function of the protein. There have been many approaches to this over the past three decades, and the literature is vast. Here I consider one model introduced by Hart and Istrail [9].

Each amino acid has several properties, and one of particular importance is its hydrophobicity. It is hydrophobic, denoted H, if it does not like water; it is hydrophilic, denoted P, if it does. Some classes of proteins, namely the globular ones, have a hydrophobic core, and that helps to predict some of the structure.

A lattice is a grid approximation to space, and we want to assign amino acids to points on the grid such that the total number of hydrophilic neighbors is a maximum. Figure 5 shows an example. The score is 2, coming from the bonding of amino acid residues 1-6 and 2-5.

Sequence: HHP P HHHP P 123456789

Fold:

987

456 3 21

Fig. 5. Example Lattice Fold

Hart and Istrail [9] laid the foundation for applying combinatorial optimization approximation algorithms. Their proof-technique for establishing their approximation algorithms has become a paradigm for this area.

The IQP model is as follows. Let

xip =

1 if acid i is assigned to point p ; 0 otherwise.

Let H denote the set of hydrophobic acids. Further, let N (p) denote the neighbors of point p (excluding p). For a square grid, one can use the Manhattan distance to define neighbors:

N (p) = {q : |Xp - Xq| + |Yp - Yq| = 1},

where (Xp, Yp) denotes the coordinates of point p. Then, the IQP for the Lattice Protein Folding Problem for a sequence of

n amino acids is:

8

Harvey J. Greenberg

max p qN (p) subject to:

i,jH:j>i+1 xipxjq

p xip

= 1 i

i xip

1 p

qN (p) xi+1,q xip p, i < n

qN (p) xi-1,q xip p, i > 1

x {0, 1}.

The objective scores a 1 when hydrophobic acids i and j are assigned to neighboring points p and q (xip = xjq = 1). The added condition j > i + 1 is to avoid double counting and exclude those that are already adjacent in the given sequence.

The first constraint requires that each acid be assigned to exactly one point; the second requires that at most one point is assigned to an acid. The last two constraints require backbone neighbors to remain neighbors in the fold -- that is, if acid i is assigned to point p (xip = 1), acid i ? 1 must be assigned to some neighbor (xi?1,q = 1 for some q N (p)).

We must add symmetry exclusion constraints [2]. Any rigid motion, like translation and rotation, appears as an alternative optimum since the variables have different values, but such groups are the same fold. For example, the fold in fig. 5 can be rotated, as shown in fig. 6. I rotated it 90o clockwise about the middle acid (#5). Number the points such that x11 = 1 in the original fold, so x13 = 1 in the rotation -- that is, even though the folds are the "same," the IQP solutions are different.

4

3

9

5

26

8

1

7

Fig. 6. Example Lattice Fold (fig. 5) Rotated 90o Clockwise

Extend the model to eliminate some symmetries at the outset with the following symmetry exclusion constraints:

Fix middle acid at mid-point of grid: xmpm = 1 Restrict acid 1 to the upper half of quadrant III: pQ x1p = 1. Fixing the middle acid prevents translation, and restricting acid 1 to the upper half of quadrant III, as indicated in fig. 7, prevents equivalent folds by reflection about the 45o line.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download