Statistics Using R with Biological Examples
嚜燙tatistics Using R
with Biological Examples
Kim Seefeld, MS, M.Ed.*
Ernst Linder, Ph.D.
University of New Hampshire, Durham, NH
Department of Mathematics & Statistics
*Also affiliated with the Dept. of Nephrology and the
Biostatistics Research Center, Tufts-NEMC, Boston,MA.
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use.
1
Preface
This book is a manifestation of my desire to teach researchers in biology a bit
more about statistics than an ordinary introductory course covers and to
introduce the utilization of R as a tool for analyzing their data. My goal is to
reach those with little or no training in higher level statistics so that they can do
more of their own data analysis, communicate more with statisticians, and
appreciate the great potential statistics has to offer as a tool to answer biological
questions. This is necessary in light of the increasing use of higher level
statistics in biomedical research. I hope it accomplishes this mission and
encourage its free distribution and use as a course text or supplement.
I thank all the teachers, professors, and research colleagues who guided my own
learning 每 especially those in the statistics and biological research departments
at the University of Michigan, Michigan State University, Dartmouth Medical
School, and the University of New Hampshire. I thank the Churchill group at the
Jackson labs to invite me to Bar Harbor while I was writing the original
manuscript of this book. I especially thank Ernst Linder for reviewing and
working with me on this manuscript, NHCTC for being a great place to teach,
and my current colleagues at Tufts-NEMC.
I dedicate this work to all my students 每 past, present and future 每 both those
that I teach in the classroom and the ones I am ※teaching§ through my writings.
I wish you success in your endeavors and encourage you never to quit your
quest for answers to the research questions that interest you most.
K Seefeld, May 2007
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use.
2
1
Overview
The coverage in this book is very different from a traditional introductory
statistics book or course (of which both authors have taught numerous times).
The goal of this book is to serve as a primer to higher level statistics for
researchers in biological fields. We chose topics to cover from current
bioinformatics literature and from available syllabi from the small but growing
number of courses titled something like ※Statistics for Bioinformatics§. Many
of the topics we have chosen (Markov Chains, multivariate analysis) are
considered advanced level topics, typically taught only to graduate level
students in statistics. We felt the need to bring down the level that these topics
are taught to accommodate interested people with non-statistical background. In
doing so we, as much as possible, eliminated using complicated equations and
mathematical language. As a cautionary note, we are not hoping to replace a
graduate level background in statistics, but we do hope to convey a conceptual
understanding and ability to perform some basic data analysis using these
concepts as well as better understand the vocabulary and concepts frequently
appearing in bioinfomatic literature. We anticipate that this will inspire further
interest in statistical study as well as make the reader a more educated consumer
of the bioinformatics literature, able to understand and analyze the statistical
techniques being used. This should also help open communication lines
between statisticians and researchers.
We (the authors) are both teachers who believe in learning by doing and feel
there would be little use in presenting statistical concepts without providing
examples using these concepts. In order to present applied examples, the
complexity of data analysis needed for bioinformatics requires a sophisticated
computer data analysis system. It is not true, as often misperceived by
researchers, that computer programming languages (such as Java or Perl) or
office applications (such as spreadsheets or database applications) can replace a
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use.
3
statistical applications package. The majority of functionality needed to perform
sophisticated data analysis is found only in specialized statistical software. We
feel very fortunate to be able to obtain the software application R for use in this
book. R has been in active, progressive development by a team of top-notch
statisticians for several years. It has matured into one of the best, if not the best,
sophisticated data analysis programs available. What is most amazing about R is
that it completely free, making it wonderfully accessible to students and
researchers.
The structure of the R software is a base program, providing basic program
functionality, which can be added onto with smaller specialized program
modules called packages. One of the biggest growth areas in contributed
packages in recent years has come from bioinformatics researchers, who have
contributed packages for QTL and microarray analysis, among other
applications. Another big advantage is that because R is so flexible and
extensible, R can unify most (if not all) bioinformatics data analysis tasks in one
program with add-on packages. Rather than learn multiple tools, students and
researchers can use one consistent environment for many tasks. It is because of
the price of R, extensibility, and the growing use of R in bioinformatics that R
was chosen as the software for this book.
The ※disadvantage§ of R is that there is a learning curve required to master its
use (however, this is the case with all statistical software). R is primarily a
command line environment and requires some minimal programming skills to
use. In the beginning of the book we cover enough ground to get one up and
running with R.. We are assuming the primary interest of the reader is to be an
applied user of this software and focus on introducing relevant packages and
how to use the available existing functionality effectively. However, R is a fully
extensible system and as an open source project, users are welcome to contribute
code. In addition, R is designed to interface well with other technologies,
including other programming languages and database systems. Therefore R will
appeal to computer scientists interested in applying their skills to statistical data
analysis applications.
Now, let*s present a conceptual overview of the organization of the book.
The Basics of R (Ch 2 每 5)
This section presents an orientation to using R. Chapter 2 introduces the R
system and provides guidelines for downloading R and obtaining and installing
packages. Chapter 3 introduces how to work with data in R, including how to
manipulate data, how to save and import/export datasets, and how to get help.
Chapter 4 covers the rudimentary programming skills required to successfully
work with R and understand the code examples given in coming chapters.
Chapter 5 covers basic exploratory data analysis and summary functionality and
outliners the features of R*s graphics system.
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use.
4
Probability Theory and Modeling (Ch 6-9)
These chapters are probably the most ※theoretical§ in the book. They cover a lot
of basic background information on probability theory and modeling. Chapters
6-8 cover probability theory, univariate, and multivariate probability
distributions respectively. Although this material may seem more academic
than applied, this material is important background for understanding Markov
chains, which are a key application of statistics to bioinformatics as well as for a
lot of other sequence analysis applications. Chapter 9 introduces Bayesian data
analysis, which is a different theoretical perspective on probability that has vast
applications in bioinformatics.
Markov Chains (Ch 10-12)
Chapter 10 introduces the theory of Markov chains, which are a popular method
of modeling probability processes, and often used in biological sequence
analysis. Chapter 11 explains some popular algorithms 每 the Gibbs sampler and
the Metropolis Hastings algorithm 每 that use Markov chains and appear
extensively in bioinformatics literature. BRugs is introduced in Chapter 12
using applied genetics examples.
Inferential Statistics (Ch 13-15)
The topics in these chapters are the topics covered in traditional introductory
statistics courses and should be familiar to most biological researchers.
Therefore the theory presented for these topics is relatively brief. Chapter 13
covers the basics of statistical sampling theory and sampling distributions, but
added to these basics is some coverage of bootstrapping, a popular inference
technique in bioinformatics. Chapter 14 covers hypothesis testing and includes
instructions on how to do most popular test using R. Regression and ANOVA
are covered in Chapter 15 along with a brief introduction to general linear
models.
Advanced Topics (Ch 16-17)
Chapter 16 introduces techniques for working with multivariate datasets,
including clustering techniques. It is hoped that this book serves as a bridge to
enable biological researchers to understand the statistical techniques used in
these packages.
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use.
5
................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.
Related searches
- statistics formula sheet with explanation
- using commas with quotation marks
- using if with conditional formatting
- statistics cheat sheet with examples
- using databases with python
- using axios with react
- using technology with preschoolers
- using pycharm with anaconda
- how to find r with r2
- statistics final exam with answers
- using although with a comma
- using alexa with fire tv