Parallel Computing in a Biostatistical Context



A Parallel Computing System for R

By Peter Lazar and Dr David Schoenfeld

Department of Biostatistics,

Mass General Hospital, Boston MA

I. Abstract:

With the advent of high throughput genomics and proteomics, biology and medical researchers are faced with data processing problems of a magnitude not encountered previously. Parallel computing offers an attractive option for researchers seeking to apply statistical and analytic computation to large–volume biological data. We have developed a parallel computation system Biopara. Written in R, it allows a user to easily dispatch an R computation job to multiple worker computers. We examine the challenges of designing an efficient parallel computation framework, including user interface; proper data passing and control; installation, overhead and user management issues; and fault tolerance and load balancing. We show how our parallel computation system,using R exclusively, addresses these challenges. A statistical bootstrap procedure is used to illustrate the system. Biopara is available on CRAN:

Keywords: Parallel Computation, R

Correspondence Author: Peter Lazar, plazar@ , 617-724-0309 (fax 617-724-5713), 50 Staniford Street, suite 560, Boston MA 02114

II. Introduction:

With the expansion and integration of vast electronic datasets into every facet of biological research, medicine and data analysis, today’s researcher has access to overwhelming amounts of information. Statistical processing of this data or even a subset of it is a daunting task. Non-parametric methods such as simulation, bootstrap and jackknife are especially well suited to meeting these challenges but such methods are computationally intensive and involve processing large numbers of independent calculations. One solution to these problems is parallel computation: dividing of a problem into “n” smaller sub problems and distributing it to a cluster of machines for a potential “n”-fold increase in speed.

Unfortunately, not all applications are friendly to parallel and multiprocessor applications. In particular, the popular and powerful statistical package “R” is an inherently single processor application. We did not feel that the contemporary toolkits such as Paralize(See (3)), a parallel toolkit of Matlab or SNOW (See SNOW(2) ), a parallel toolkit for R, fit our needs. Neither of these systems had direct control of environments or individual workers. Neither had any fault tolerance or load balancing. In response, we developed a sockets based message passing system called Biopara. Our system allows R to carry out coarsely parallel computations on a cluster of heterogeneous machines without requiring the user to perform any special programming. Biopara is an extensive wrapper system that enables data transfer, job management and intercommunication between individual R processes. This paper will address issues that arise in designing any parallel system and how such issues were dealt with while designing Biopara. In the following sections, we will discuss the primary challenges faced when designing a multi-user parallel system for data analysis.

III. Design Issues in Parallel Computation

A. User Interface:

We assume that the user is running R on a desktop computer and that the cluster is exists on a network where the biopara system is installed. The user interface is made up of one function call with only two parameters that specify the specifics of the job and two other minor housekeeping parameters. A user call to Biopara is called a job and the individual programs being executed on each computer are referred to as tasks. One of the parameters is a text string that contains a single R expression or list of R expressions. In the case of a single expression, this expression is executed the number of times specified by a second parameter. The result of each execution is returned as an object in a list that is the output of the function call. In the case of a list of expressions the count parameter is ignored, each expression is executed separately and the result of each execution is returned. In either case parallel processing is utilized to evaluate the expressions with the degree of simultaneity depending on the number of nodes in the cluster.

For instance, to perform a bootstrap calculation the user would specify the number of bootstraps to be performed and a string containing/defining the R function that performs a single bootstrap calculation. The following would call biopara in order to execute usrfn 90 times across the cluster, dividing the execution automatically across the available machines:

output ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download