Lazy Student Analogy (Cheating Student Analogy)



Consensus Algorithm

for Structure Prediction

The ROAWBetter Algorithm

and Bachalor Framework

Ross Bayer

Alex Chan

William Lu

Olga Russakovsky

Introduction

Protein structure prediction is an active area of research with many algorithms, none of which completely solve the problem. In recent years, consensus-based algorithms that combine the results from multiple algorithms have proven to perform best among all prediction algorithms. In this paper, we propose a novel approach for a consensus-based algorithm. Our approach is based on dividing the space of all proteins into areas where an algorithm performs the best and using that algorithm to predict the protein structure. The idea is more intuitive with an analogy.

Lazy Student Analogy

Suppose you are a lazy student preparing for a comprehensive exam. The exam consists of several areas. In the example mentioned below, the areas are Literature, History, Math, and Science. Unfortunately, the questions on the exam are not separated by sections nor are they labeled by the area they are from; however, each question has an equal probability of coming from one of the areas.

Since you are lazy, you don’t want to study all of these areas (not to mention that it may not be possible to study everything in the time before the exam). Fortunately, the exam is an open collaboration exam, meaning you may freely discuss any of the questions with the other exam takers. In addition, you know how well each person did on a previous comprehensive exam and his score in each area. A concrete example of this problem with four exam takers other than you is depicted in Figure 1. How would you collaborate to come up with the answers to the exam?

[pic]

Figure 1: A specific example of the Lazy Student Analogy. In the example, the comprehensive exam has four areas: Literature, History, Math, and Science. The letter grade of how well each exam taker did on a previous comprehensive exam is given.

The first solution would be to simply collaborate with the student that had the highest score on the last comprehensive exam. This makes sense since if the exam is a robust measure, the student would score equally well on the current exam and as a result, you would score as well as he does. Unfortunately, this may not be the highest score possible. In the above example, this solution corresponds to an overall score of C, if the previous result is an indication of the current exam.

The first solutions had a few drawbacks. First, it doesn’t take advantage of the fact that there are many exam takers, not just the one student with the highest score on the last exam. Second, it doesn’t use the information given by the exam areas. By examination, we will note that although several of the students didn’t score the highest on the comprehensive exam, they have areas they scored extremely well on. If we could take advantage of this fact, it would be possible to score higher than the above score of C.

The second and better solution is just this – take advantage of the information of how well each exam taker did on each area of the exam. Although an exam taker may have a lower overall score, he may do well in a particular area of the exam. If it is possible to break the exam down into finer chunks, then this may be evident. Therefore, the better solution is to use specialized information to get a better answer. For each question, determine the area to which it belongs and then collaborate with the exam taker that had the best score in that area. This process is repeated for all of the questions. In the above example, three of the students did well on Literature, History, and Math, respectively; even though their overall scores are lower than the highest scoring student. One obvious way to break the exam down is by the different areas. So one way to take the exam would be to first determine the area of the question and then ask the student that did the best in that area. Using this approach, we would get a score of A in Literature, A in History, A in Math, and C in Science. The overall score is an A-/B+, which is much better than the overall score of C obtained above with the first solution. This is the idea behind our consensus algorithm.

Although we are able to obtain a better score with the second solution, this is by no means an easy feat to accomplish. The difficulty with this solution lies in the fact that the questions are not divided nor are they labeled by area. Therefore, how well we perform with this solution is in no small part affected by how well we determine the area a question is from. In fact, if we determined the area incorrectly, the actual score may be worse than that of the first solution.

Parallels to Protein Structure Prediction

The parallels between the Lazy Student Analogy and protein structure prediction are as follows. The task of predicting the structure for a prediction is essentially a comprehensive exam since any algorithm must take into account numerous factors that influence the protein structure, such as the energies involved in folding, the function of the protein, and the amino acid sequence. Each algorithm that tries to predict protein structure is an exam taker. Unfortunately, protein structure prediction algorithms vary in performance. Every algorithm has a specialty such as threading to a particular profile, computing certain natural energies, or sequence homology and database search. A typical performance versus space of all proteins graph is depicted in Figure 2.

[pic]

Figure 2: A typical plot of performance over the space of all proteins for three algorithms.

Taking a cue from the Lazy Student Analogy, if we were able to break the space of all proteins into discrete areas corresponding to the places where a particular algorithm performs well, we can obtain a much better prediction than using any one algorithm alone. But how do we divide the space of all proteins?

Statistics on the number of new folds discovered in recent years have shown that it is a relatively small number, which suggests that the number of folds may approach a finite number asymptotically. If this is indeed the case, then it would make sense to divide the space of all proteins by protein folds. Each algorithm would predict proteins with particular folds. Using this theory and the assumption that the number of folds is constant, we have divided the space of all protein using the families in SCOP. Again, just as in the analogy, this is not a trivial problem. This is evident in the fact that protein classification is not a trivial problem in practice; however, the advantage of our approach is that it is able to combine solutions in both protein classification and protein structure prediction to obtain better predictions.

ROAWBetter Algorithm

Our project consisted of three basic parts: choosing the right algorithms to include in our consensus, determining which of those algorithms to run on which protein families, and deciding how to classify the unknown protein into the correct family. We will discuss each of these in detail below.

Servers

We spent quite a long time researching different algorithms and deciding which ones to use for our project. In the end, we chose algorithms based on several factors. The first one was their performance in CASP and LiveBench. Obviously, we wanted algorithms that did a good job with predicting the structure of most proteins, so that when we combined them we could achieve even impressive results. For example, Robetta, which beat all other fully-automated algorithms by a long shot in CASP was an easy pick.

The second factor was participation in both LiveBench and CASP. We wanted to be able to both train and test our program, and so we wanted two different sets of data. We decided that it would be best to chose algorithms that participated in both of these competitions so that we are able to obtain an idea as to how well our consensus algorithm would have performed. However, if we were to expand this project further, this would not be a necessary restriction.

The third factor was the ability to communicate with the servers. Some algorithms (such as ORFeus-2 and SUPERFAMILY) performed very well, yet their websites were perpetually unavailable, and so we were forced to consider other servers instead.

After eliminating all servers that did not participate in both CASP and LiveBench, and those that we could not communicate with, we were left with a relatively small group. In the end, we settled on four algorithms, namely Robetta, FFAS03, SAM-T02, SAM-T99, mostly based on their performance. We considered things such as performance on “easy” versus “hard” models in LiveBench, and we tried to have algorithms with a variety of approaches to structure prediction. For example, Robetta is an ab initio method, while FFAS03 is based on profile-profile comparison algorithm, and the SAM servers create multiple alignments and use the alignment to predict secondary structure and then to build an HMM used for searching the protein databank for similar proteins.

We used two versions of the SAM server because they both performed well, and we weren’t sure which one to choose. Ideally, however, we would probably not want to use two very similar servers because, chances are, they perform well on the same proteins. So to decrease the running time of our program, we would probably eliminate one of these servers from the final version. We would do so after extensive training to determine which of the servers performs better overall.

However, it is also a valid approach to leave both of them in, because having more servers means better results overall, even if the servers are similar. Even if they perform well on similar proteins, chances are one will not outperform the other on all of proteins, and so there will still be some increase in performance of our consensus algorithm if we include both servers.

Unfortunately, due to time constaints, we were unable to do extensive analysis of all the algorithms. In the future, it would be very good to do more research into the nature of each algorithm and to analyze which proteins it performs best on and why. It would also be a good idea to try our program using a variety of servers, and to see empirically which combination works best.

Training

So now that we have discussed which algorithms we decided to include in our project, we can say a few words about how we determined which algorithm to run when.

We did so by creating a databank using data from LiveBench 6 and 8 (these being the only LiveBench competitions where all 4 of our algorithms were run). For each protein that was tested in the competition, we knew how well each of the 4 algorithms we chose performed when predicting its structure, and also which SCOP family (if any) the protein belongs to. So we ended up with some representative proteins for many families. Ideally, we would want to have many representative proteins for every family; however, we unfortunately did not have access to that much data. In the end, this is what our databank looked like:

[pic]

Figure 3: A schematic diagram of our databank. The bright blue proteins are those that we had data for. The families represent all families in SCOP – the yellow ones are those that contained at least one of the proteins from LiveBench, i.e. those families that we had some data for.

After reading in all the data, we considered all proteins belonging to a particular family and averaged the performance results for each algorithm on that family. Then given a new protein, we would classify it into the correct family as described below, and run the algorithm that had the highest average score for that family.

Note that some families (actually, most to be honest) did not have any data, because there were only 256 proteins tested in LiveBench 6 and 8. In this case, we used the default algorithm, Robetta, on any protein that was classified into one of those families. We chose Robetta because it was the best performing algorithm out of the 4 we were using (this choosing of a “default” algorithm was done automatically by averaging results from all proteins in the data source).

Classifier

After deciding which algorithms to use on which family, the final step of our project was figuring out how to classify the new protein into the correct family.

A problem that we faced during classification is that classifying proteins into families is inherently a structure prediction problem. In SCOP, proteins are first classified into classes according to their folds (such as “all-alpha” proteins), and then from there classified further into families, based on even more intricate structure patterns. So in order to be able to correctly classify a protein into its family, we need to figure out its structure. Yet we want to use knowledge about the protein’s family to figure out the structure of this protein. This is a chicken-and-egg problem.

A solution we came up with is rather simple. Instead of trying to look at the potential structures of our unknown protein, we can instead compare its sequence with proteins of known structure. Thus, given a new protein, we use BLAST to compare it to all proteins in the PDB (Protein Data Bank) to determine which ones it is most similar to. The BLAST was performed on the PDB because it is a database of protein structure, and thus provides an unbiased link between structure and sequence. We then used the weighted average of the SCOP families of the results to predict the family for this protein. Of course, this is not a completely rigorous method of classifying SCOP family, and the predicted family might possibly be far from the actual family the protein belonged to. However, it seemed to work reasonably well for our purposes: it correctly classified 11 out of the 12 proteins in CASP 5 that were in the SCOP database.

Ultimately, it would be a good idea to iterate our algorithm and check that the protein was indeed classified into the correct family. A way of doing this is to first predict the family using the method described above, then to run the best algorithm for that family and predict the protein structure, and then use the structure to obtain a more accurate prediction of the family the protein belongs in and repeat the whole process again. This would be a good addition to our program in the future; however, we believe that for now, this approach works well enough.

One final note is that sometimes this approach failed to identify a family for the protein at all, because BLAST returned no results (given that the PDB database is not all encompassing), or results not in SCOP. This was rather rare, and in this case we used the same fix as above, in the case where the family did not have any training data: we ran the default server (Robetta).

Results

Overall, we believe we obtained fairly good results, especially taking into account the fact that we had very limited training data. We tested our algorithm using data from CASP 5 and 6, based on comparison of GDT (Global Distance Test) scores on best models. GDT is the percent of residues that fall under a certain distance cutoff, as determined by the evaluators from CASP.

Note that actual rank in CASP is calculated not only based on GDT scores, which explains the discrepancies in the table below. However, GDT scores are used in the calculation, and thus there is a high correlation between the two. Since we did not have the benefit of human experts who could judge our predicted structure, we had to evaluate based on an automated measure, such as GDT scores only.

Using this system of judging, we would have come 7th in CASP 6, which included human-expert systems, and 1st in CAFASP 3, which means we beat all the fully automated servers. On CASP 6 proteins, our GDT score was 47.21. Our average for CASP 5 was even higher, at 47.69. Figure 4 shows the results of other algorithms in CASP 6.

|Algorithm |Rank |Average GDT score |

| | |of best models |

|Ginalski* |1 |51.5099 |

|KOLINSKI-BUJNICKI* |2 |49.7810 |

|BAKER* |3 |49.1034 |

|Skolnick-Zhang* |4 |49.0531 |

|GeneSilico-Group* |5 |47.2468 |

|CHIMERA* |6 |45.5583 |

|CBRC-3D* |7 |46.9423 |

|SAM-T04-hand* |8 |47.3330 |

|Jones-UCL* |9 |45.9990 |

|FISCHER* |10 |45.8653 |

|Sternberg* |11 |44.6241 |

|SBC* |12 |44.4561 |

|SBC-Pmodeller5* |13 |44.6399 |

|BAKER-ROBETTA_04 |14 |46.5282 |

|MCon* |15 |42.8894 |

|CAFASP-Consensus* |16 |42.4321 |

|TOME* |17 |44.8038 |

|ACE |18 |44.6786 |

|BAKER-ROBETTA |19 |46.4150 |

|SAM-T02 |69 |38.2556 |

|SAM-T99 |102 |28.9695 |

|FFAS03 |106 |36.2487 |

Figure 4: Table showing the top servers of CASP6, along with the 4 servers we used in our consensus. Algorithms denoted by a * are human-expert predictions. The 6 algorithms highlighted in blue are those that beat ROAWBetter. The 4 algorithms in green are those we used.

It is clear from this table that our algorithm performed quite well. The main thing is that we managed to beat Robetta (hence the name, ROAWBetter) even though it was our default server, and so we were choosing it most of the time because of our lack of data. However, since our GDT score was higher than Robetta’s, it means we were choosing different algorithms over Robetta at good times.

The following diagram is a comparison between the actual structure and the structures predicted by Robetta and by FFAS03 on a certain protein. Note that the structure from FFAS03 is much closer to the actual structure than the one predicted by Robetta. This was one of the proteins where our algorithm chose FFAS03 over Robetta, thus increasing our overall performance.

[pic]

Figure 5: Sample diagram of actual versus predicted structures of a protein.

So our results are quite promising already. We hope that if we had a chance to train with much more extensive data, we would achieve even better results.

Bachalor Framework

Perhaps more important in the long-term than the specific details of the particular algorithmic choices we made in ROAWBetter is the overall framework being proposed. In the simplest sense, that framework consists of four key elements:

o Choice of Servers

o Training

o Algorithm Determination

o Family Classifier

The algorithm for any one of these particular elements could be changed and still successfully plugged back into the framework. This means that the framework can be continually improved and amended to produce better and better results, as depicted in Figure 6.

[pic]

Figure 6: A schematic diagram of the “plug-and-play” nature of the Bachalor framework being proposed. Any one of these pieces can be changed and reinserted into the system. This allows for continual refinements and improvements.

Specifically, had we had more time, we ourselves would have liked to make several significant improvements. The biggest change would undoubtedly be to change the training system to train on a much larger source of data. Ideally, one would run every protein in SCOP (or at least a large number, and certainly a decent number from each family) on each of the servers of interest, and then use some scoring system to adjudge the results. This way, we would have generated a huge amount of data, which would have made our system much more robust, and not needing to heavily depend upon the default algorithm (Robetta). Unfortunately, we could not do this in the limited time available since the servers tend to limit how many requests can be made from any one domain name. In addition, an improvement which would likely be good would be to look at normalized Z-scores rather than just raw scores, so as to take into account algorithm variability on a particular protein.

Furthermore, the general framework would work with any combination of servers. We certainly would have liked to use algorithms such as ORFeus-2, SUPERFAMILY, ShotGun, and 3D-Jury. Our framework would be expected only to improve by having more servers to pick from, especially ones (as above) which performed well.

A more complex way of determining which algorithm to use could also be implemented. We just picked the algorithm that did best (on average) for the particular family the new protein got classified into. However, another reasonable approach might be to actually run a sequence similarity test against all proteins in the databank for that particular family, and take a weighted average (with weights based on sequence similarity score) of algorithm performance to decide which algorithm to use overall. This has the advantage of perhaps allowing a new protein’s classification to be more fine-grained, by effectively being classified into a subset of the family, but it also loses the advantage of averaging which our initial system had. It would be interesting to see if this change would lead to improved results, but certainly testing could easily be done to see which performs better.

Lastly, the family classification method (which is quite critical) could certainly be improved upon. Our classifier seemed to actually work rather well. However, there is little theoretical reason for it to do so, so it might be nice to have a more theoretically sound means of predicting SCOP family.

It is also noteworthy that our system also allows estimation of confidence in the accuracy of our prediction. Several sources of confidence estimation (e.g. amount of data collected for the classified family, classifier confidence based on BLAST Expectation values and number of hits, and confidence estimate of the server itself) can be combined to form a good idea of how accurate the given prediction is likely to be.

With all these improvements possible, and indeed other improvements that others could come up with, the framework lends itself to continual revision and overall improvement, which leads us to believe that it is a very promising system, given that even with our simplistic algorithms and very sparse training data, the algorithm still showed promising results.

Conclusion

Another future direction may be dividing the problem of protein structure prediction into subproblems. To revisit the Lazy Student Analogy, if we knew there would be a comprehensive exam where collaboration is allowed, then the logical thing to do would be to divide all the areas being covered by the exam among the exam takers. Each person would study a particular area, becoming extremely proficient in that area. During the exam, we would consult that particular person as an expert in that area. This approach may hold some promise as it extends into protein structure prediction. Since it is possible to divide the space of all proteins by fold, then a direction would be to have specialized protein structure prediction algorithm specifically targeted at certain folds. These algorithms would be able to take advantage of local environments of specific folds to be more proficient at predicting those structures.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download