Research Statement - Yale University

Research Statement

John W. Emerson Associate Professor Department of Statistics, Yale University

I have two broad areas of interest, computational statistics and data analysis. In each area, my work will continue to strive for a balance between excellence in the practice of statistics on important real-world problems and opportunities to innovate. The use of statistics is expanding rapidly inside and outside of academia, with much of the growth in computationally intensive areas using large data sets from novel sources; we must respond to this growth by developing effective and practical statistical tools and by developing new methods of teaching in and out of the classroom. This is an emerging type of scholarship which will become increasingly important. I'm excited to be able to lead such efforts, which extend beyond the classroom, beyond the Department, and beyond Yale.

Computational Statistics

Much of my research over the past four years has been in an area closely aligned with computer science, where most statisticians lack the necessary expertise to contribute to the field. It has integrated research concepts in computer science with the practice of statistics and the process of applied statistical research with data sets growing in size and complexity. It was awarded the 2010 John M. Chambers Statistical Software Award given by the American Statistical Association Section on Statistical Computing, and was the dissertation topic for my second PhD student, Michael Kane. However, this project did not result from an abstract realization of the limitations of statistical computing circa 2007, but rather from an effort to study the well-known Netflix Prize competition data. Years later our solution provides a roadmap for scalable solutions to computing with massive data. We provide extensions to the R language to support computing with massive data, and describe how a new statistical computing environment (or a major update of the R environment) could benefit from incorporating our solution into the native memory allocation scheme. Such a project requires ongoing nurturing as it evolves in response to inquiries and suggestions from researchers around the world. As a result, it remains a fertile area for research; it is immediately useful with challenging new problems in the practice of statistics and is central to our exploration of future paths for statistical computing. I anticipate continued collaborations with Mike in these areas.

I've also developed interests in graphical methods. My early work on mosaic plots ? graphical displays for categorical data where areas are proportional to counts in contingency tables ? led to the creation of the generalized pairs plot. This innovative graphical display recognizes that data sets often contain both quantitative and categorical data, where traditional scatterplot matrices fail to provide effective displays. Such work follows in the tradition of exploratory data analysis generally attributed to John Tukey. A paper describing this methodology is nearing completion with Walton Green (Yale GRD '07, Geology and Geophysics) and other collaborators who have developed a new

implementations based on my original design. The generalized pairs plot (gpairs) and other new graphical techniques I have developed are available in an R extension package, and I anticipate continued contributions to the literature and influencing the future practice of statistics through such innovations.

My additional research interests in computational statistics involve more traditional methodological development. First, I'm working to extend techniques for Bayesian change point analysis to multivariate data and for real-time analysis of data, building on work done with my first PhD student, Chandra Erdman. I recognize a need for such methodology for research problems in neuroscience, for example, where a group of researchers at Yale have promising applications for this type of analysis. I have also worked on computational methods for processing streaming data, and graphical techniques for displaying real-time data. These are promising areas for future research, with an increasing number of problems involving live data streams and demanding meaningful statistical analysis under severe time constraints as well as computing constraints.

Data Analysis

In the broad area of data analysis, I have two primary research focuses where I have established a record of leading research. First, I've been influential in the wide range of studies of environmental performance housed at the Yale Center for Environmental Law and Policy (YCELP) and in collaboration with the Center for International Earth Science Information Network (CEISIN, based at Columbia University). I have also made broader contributions to the field, leading the call for full transparency of data and methods in such research projects. Specifically, I pushed for making the complete data sets freely available and inviting other researchers and government officials to propose alternate methods or contribute to improving the quality of the data. The resulting international attention encouraged many countries (notably Singapore and Korea) to improve data quality and institute policies to improve their environmental performance. Work on the 2012 EPI is already underway and will no longer be considered a pilot project; its release in public policy circles will be accompanied by submissions of one or more publications in refereed journals.

Second, I have developed a series of papers studying various aspects of Olympic scoring systems. The first uncovered a serious problem with the scoring system for international figure skating competitions introduced following the judging scandal of the 2002 Olympic Winter Games. The second conducted an innovative analysis of biases evident in scores of an Olympic diving competition; here, standard analysis of variance techniques were unable to properly estimate the model parameters because of some unusual constraints in combination with data limitations. A third presents a simpler analysis of the diving scores at a level accessible to high school students and teachers. Most recently, changes in 2010 to the reporting of figure skating scores produced a new problem for statistical analysis; two resulting papers are in submission, one describing the methodology used to analyze the problem and the other describing the implementation of unusual goodness-of-fit tests required by the analysis. These papers demonstrate my

commitment to excellence in the practice of statistics on real-world problems. They also provide examples of the types of problems I will continue to search for.

The Statistical Clinic and Other Collaborations

I created the weekly Statistical Clinic as a service to the broad community of researchers at Yale and to expose our graduate students to a wide range of research topics, data sets, and statistical methods. Historically, more statistics courses have been taught outside the department than by the department at both the graduate and undergraduate levels. However, there is expanding demand for expert statistical advice which is not met by course offerings, other forms of collaboration among faculty and researchers, or mentoring of students by faculty. The popularity of the clinics, often booked weeks in advance, reflects this need. Many of the topics require the statistical expertise I can offer with some of our more experienced PhD students; most Master's students lack the experience to contribute actively and instead work to strengthen their skills by completing reports after each clinic. John Hartigan, Professor Emeritus of Statistics, and Peter Peduzzi, Director of the Center for Analytical Sciences, are regular attendees at the clinics.

I have a large number of stand-alone projects and collaborations, many resulting from visits to our Statistical Clinics. Every year a few of these clinics lead to collaborative research opportunities. Many of the publications listed in the Other Publications section of my Curriculum Vitae resulted from such collaborations. I also have ongoing collaborations with Stephen Stearns (Ecology and Evolutionary Biology), Paul Anastas (Department of Chemistry), Russel Barbour and Robert Heimer (Center for Interdisciplinary Research on AIDS), Zarrar Shehzad (graduate student, Department of Psychology), Mark Laubach (Neurobiology), and Mark Pagani (Department of Geology and Geophysics). I look forward to continuing with these collaborations, many of which will produce substantial contributions to the literature and to statistical practice in the coming years.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download