Try it: Genetic Drift in Excel - West Virginia University

[Pages:7]Try it: Genetic Drift in Excel

In this class you will gain plenty of experience running simulation programs written by others. While you may not want to bother with the nuances of programming languages, you will come to find that Microsoft Excel provides much of the power of a full programming language with a much shallower learning curve. Learning to use it well, or really with any degree of proficiency, will make your academic life much easier.

Excel: Pretty much these guys with less marketing.

In the lab we have performed Monte Carlo simulations of genetic drift using the program Populus. However, this is actually a very simple task to perform and can be done in Excel quite easily. Furthermore, the flexibility of Excel allows you to easily generate many summary statistics that do not come standard with the simulation programs we use in the lab. Essentially, you can have your cake, eat it, and then bake a much more awesome cake. Below, I show you how to perform a Monte Carlo simulation of genetic drift (with and without selection) in Excel from scratch. Try it, and then play around with it. See what happens! Genetic Drift Without Selection

Genetic drift, at its most fundamental, is simply a sampling problem. For each finite population, I sample alleles each generation, and the frequency of each allele will change from generation to generation because I won't always pick the same number of each allele. Thus, over time the allele frequency will drift, either to fixation or loss. Because I am not worried about selection yet, I only need two parameters to start my simulation: the size of each population and the starting frequency of an allele for a 2-allele locus ? I use p. I enter these in the top left corner of a spreadsheet; in this case I use N = 10 and p = 0.1, but I could pick any numbers I wanted.

I want to simulate how allele frequencies will change over time simply due to the effects of genetic drift, and I want to do this for a bunch of populations so that I can calculate some summary statistics such as the probability of fixation and the average time to fixation of my allele of interest. To begin with, I set up a population in each column, and I assume that they each have the starting value of p at generation 0. I set up about 150 populations on my spreadsheet, but you can add as many as you like. In fact, you should see how the error of your summary statistics changes as you add more populations. Here is a sample of my first two rows.

I should mention that I did not just enter the starting value of p into each of those cells. I actually like to do as little work as possible, so I entered the following into cell E2 and then dragged it across to each of my populations:

This references my starting value of p that I entered in the top-left corner. This will allow me to change this starting value and have each population automatically take this into account. Note my use of the "$" signs in front of the letter and the number. This allows me to drag the formula without the letter and number changing.

Now for the fun part! Remember that alleles are drawn from a binomial distribution. Therefore, in order to simulate the change in alleles due to chance, I need a way to simulate draws from a binomial distribution consisting of 2N trials (in my example 20) and with a probability of success equal to p in the previous generation. This is where the "Monte Carlo" part comes in (Monte Carlo being a famous casino). I enter the following formula into cell E3 and drag across to all my populations and then drag down for a large number of generations (enough to where everything reaches fixation or loss):

The breakdown: 1. $B$1 refers to the number of alleles in the population (number of trials). This is multiplied by 2,

because we have 2N trials for a diploid organism.

2. E2 refers to the value of p for this population in the previous generation (no "$" signs because I always want this to refer to p in the previous generation of a given population no matter where I drag it.

3. CRITBINOM(2N,p,RAND()) simulates a random draw from a binomial probability distribution with 2N trials and a probability of success p. CRITBINOM is a function that finds the fewest number of successes that exceed a certain cumulative probability threshold for this binomial distribution (if you're really interested in what that means, please come see me). In this case that threshold is made with a random number (from a uniform distribution) generated by the RAND function.

4. Because CRITBINOM(2N,p,RAND()) simulates the number of alleles drawn from the previous generation, I divide by the total number of alleles (2N) in order to get the value of p in the following generation.

5. The IFERROR function isn't entirely necessary, but if it is not included then the function generates an error after the population reaches fixation or loss. This function simply checks if the CRITBINOM function has generated an error (due to fixation or loss) and, if so, outputs the value of p in the previous generation, which will either be 0 or 1.

After doing this, my first few rows look something like the following. Note that your numbers will be somewhat different, as all values are dependent upon random numbers.

Notice that some populations achieve loss of the allele after only a few generations. Others are headed toward fixation and do eventually get there. Now that I have a bunch of simulated populations, I would really like to get some summary statistics. I'll show you how to obtain an estimate for the probability of fixation as well as the average time to fixation.

Obtaining an estimate for the probability of fixation is actually very easy. I simply need a count of all the populations where the frequency of my allele reached 1 (fixation), and then I will divide that by the total number of populations. Here is the formula I entered into cell B4 (you could use another cell):

The breakdown 1. COUNTIF counts the number of cells in a range where a certain condition is met. In this case I'm

interested in the last row of the simulation (shown below):

Note that all the populations have either gone to fixation or loss. COUNTIF is then simply counting all the populations where the cell value equals 1 (fixation). 2. COUNT is counting all the nonempty cells in the range. I divide the two in order to get the proportion where fixation has occurred, which will serve as my estimate for the probability of fixation. Obtaining the average time until fixation is almost as simple. I only need to get the time until fixation (if it occurs) for each population and then average them together. I obtain the time until fixation for each population using the following formula in the row after the last row in each drift simulation:

The breakdown 1. Everything is set within an IF function, which has the following format: IF(condition, value if

condition true, value if condition false). The condition in this case is whether the cell above it ? the last generation in the simulation ? has reached a frequency of 1 (fixation). 2. If fixation has been achieved, then the COUNTIF function counts all the rows (generations) where the frequency is less than 1 (" ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download