Data Sets for Use in Statistic, Measurement and Design Courses



Data Sets for Use in Statistic, Measurement and Design Courses

Charles Stegman, Calli Holaway-Johnson, Sean Mulvenon, Sarah McKenzie,

Ronna Turner, and Karen Morton

University of Arkansas

Paper presented at the Joint Statistical Meeting of the American Statistical Association, International Biometric Society, Institute of Mathematical Statistics,

and Statistical Society of Canada

Seattle, Washington

August 2006

Data Sets for Use in Statistic, Measurement and Design Courses

Abstract

A major focus in teaching graduate level courses in statistics, measurement, and design should be the analysis of data. Results can be used to illustrate key concepts underlying the procedures discussed, help students learn how to analyze theoretical data in preparation for their careers, aid in interpreting and presenting research results, and contribute to preparing future researchers. This paper presents information on a multitude of data sets applicable for teaching courses at multiple levels and the accompanying CD contains the actual datasets.

Background

It is common for textbooks in statistics and research methodology to include a disk with several datasets that are used throughout the text. Glass and Hopkins (1996) is a good example, although others could be mentioned. Textbook datasets are commonly limited in terms of the number of datasets included and the number of cases within each dataset.

The CD produced for this paper contains over 100 datasets from multiple fields, as well as Monte Carlo computer generated datasets. In addition, the datasets can be used across a range of courses from the introduction to research methodology and statistics through regression, ANOVA, multivariate, and advanced measurement.

Development of the CD

A first step was to locate publicly accessible datasets available on the web. These are datasets that can be downloaded and used in teaching so long as appropriate acknowledgement is given. For example, many researchers and professors have made their datasets available for public use through the StatLib library at Carnegie-Mellon University []. Three other helpful sites are the National Institute of Standards & Technology website [itl.div898/strd/general/dataarchive.html], the UCLA Statistics Lab website [ats.ucla.edu/stat], the Journal of Statistics Education Data Archive (publications/jse/jse_data_archive.html), and the DataFerrett []. The first site contains datasets that can used to test or demonstrate the accuracy and precision of different computer packages when analyzing statistical data. The UCLA site contains a wealth of statistical information and sample programs. The JSE Data Archive contains datasets that have been submitted by researchers around the world, and includes articles utilizing the datasets if available. The DataFerrett allows you to search multiple topics through data mining technology and select variables for different analyses.

For the CD, selected datasets have been collected from these sites, with each dataset reviewed and included because it relates to topics regularly used as examples in statistics and research methodology courses. The datasets represent data from many fields of studies as do the examples in many of the textbooks. While professors and students can access any of these public domain datasets, the advantage of collecting them on a CD is that they are put into a standard format (Excel) and made readily available for uploading into numerous statistical packages. This should facilitate their use by multiple users in a variety of courses. Each dataset includes variable descriptions as well as the bibliographic information from the original source.

Additionally, samples from large scale datasets based on government sponsored research have been generated to support substantive based educational research examples. For example, census data and other government sponsored large scale research have produced datasets, such as the Early Childhood Longitudinal Study (ECLS-K), the National Longitudinal Study of Youth (NLSY), the National Household Education Survey (NHES), and the National Education Longitudinal Study (NELS). DataFerrett can also be used to access large scale databases. The following are some of the topics that are available from DataFerrett: Health Care, Child School Enrollment, Computer Ownership & Uses, Voting & Registration, Race & Ethnicity, School Enrollment, Teenage Attitudes & Practices, and Library Use. Note the DataFerrett allows you to search these and many more topics and select the variable sets you want.

A third area where datasets have been generated is through Monte Carlo procedures. By specifying population parameters, we generated datasets that reflect educational settings and illustrate important statistical properties. Multivariate data are also generated that can be used in number of ways. For instance, variables can be selected for analysis in introductory courses and then revisited in more advanced courses like regression, design and multivariate statistics.

The Structure of the CD

Table 1 contains a list of the datasets contained on the CD. The title of each dataset is provided, as well as its name on the CD. The sample size and variables are also included. Finally, the original source for the data is given.

Insert Table 1

The datasets have been reformed into Excel files. Many of the original files were in different formats and, while statisticians are adept at handing these, many students may still be learning basic data management. Especially in introductory classes, the emphasis is on data analyses using programs like SAS, SPSS, or R. Having the Excel files allows instructors the opportunity to write one set of instructions for importing data, allowing more time to concentrate on statistical analyses. The exception is the large scale datasets from the national databases which would be applicable to more advanced classes. Given the size of the datasets and the need for the weighting factors, Excel was too limiting. In this case, dBase and SAS data files were created.

In more advanced classes, students could be expected to find, import, and clean data from the original sources. They could then analyze the data twice to make sure they get the same answers.

Example of Using Some of the Datasets

The dataset (Arkansas Math.xls) is based on simulated student data for grades 3-5 on the Arkansas Benchmark Mathematics Examination. The Arkansas Benchmark is a criterion-referenced examination that consists of both multiple-choice and open-response questions. Tests for each grade level are developed to reflect content identified in the Arkansas state frameworks. The multiple-choice and open-response sections are weighted equally in determining a student’s score. In addition to their reported scaled scores, students are categorized as Below Basic, Basic, Proficient, or Advanced. Students with scaled scores of 200 or above are considered to be proficient and above 250 are considered to be advanced. The dataset contains 216 observations on 19 variables that would be available to school personnel. The observations were generated to reflect the actual variables used by the State of Arkansas for No Child Left Behind (NCLB) school assessments.

Some of the ways we have used the Arkansas Math dataset include the following: the scaled scores can be used to demonstrate graphs (frequency distribution, frequency polygon, box plot and stem and leaf), measures of central tendency, variability, skewness, kurtosis and normality. Similarly, we have used the grade, gender and teacher variables to create subgroups for the same type of analyses. Several of the categorical variables are analyzed as well (demographics, crosstabs, and percentages). This is the material in the first five or six chapters in the introductory course. Students are required to create tables and figures using APA formats to help them in writing reports or articles.

The Arkansas Math dataset is also used to demonstrate a multitude of different statistical inferential procedures. You can select data for t-tests, ANOVA (one-way and factorials), model assumptions, multiple comparisons, effect sizes, correlation, regression, and chi-square analyses. The multiple choice and open response scores as well as the strand scores reflect multivariate data.

Another generated education dataset is Literacy Test.xls. This dataset was created to reflect data that would be available on many state criterion referenced tests that are given at different grade levels. It differs from the previous example in a couple of important ways. First, it is a larger dataset (5000 observations) and second, it includes individual student item scores tied to three stands that might be typical on a Literacy examination. The strands in this example are content, literacy, and practical. Each strand has 8 multiple-choice items (worth 2 points each) and an open-response item worth 16 points. Students receive a scaled score based on the points earned on the literacy items plus their response to a writing prompt. Other variables include gender, race, and free and reduced lunch participation. The same type of analyses mentioned above can be demonstrated with the dataset, but by having item data, a number of advanced measurement issues can also be discussed.

A third example involves the two datasets based on the binomial distribution (Random Guessing.xls, 80% Mastery.xls). These datasets involve expected performance of 50 students on examinations worth 40 points. The first set assumes guessing and the second set involves “mastery learning.” Note that instructors could actually conduct a class exercise and create the first dataset by giving students answer sheets to fill out without giving them the questions. The instructor could have the students “score” their tests with a pre-assigned answer key. The instructor could also discuss why some national tests involve a correction for guessing. Simple SAS “proc univariate” analyses show the first distribution is positively skewed (p=0.2), while the second is negatively skewed (p=0.8). Students could then practice merging the datasets and demonstrate a bi-modal distribution.

A fourth example (Star.xls) is based on student data (sample size is 150) for the STAR Reading and STAR Math tests given during the first quarter of the school year and the SAT-9 (reading, literacy, and math) given in the spring. Student gender is also included so that there are six variables for each student. Instructors can use the data for descriptive statistical purposes as well as correlation and regression analyses (including the correlation matrix, multiple regression, and testing for bivariate normality). Note an instructor could also do simple procedures using the total data set, separate analyses for each gender, test for equality of correlations, parallelism of regression lines, ANCOVA and MANOVA. One use of such data might be identification of “at-risk” students and discuss potential interventions that might be used between October and May.

The Diamond Pricing datasets provide an example of how different analyses may require reformatting of the datasets. With the Diamond Pricing.xls dataset, students may conduct univariate analyses. With the Diamond Pricing With Dummy Variables.xls dataset, students can perform more complicated analyses such as multiple regression. One valuable exercise might be to have students begin with the basic dataset and create the Data Set With Dummy Variables.xls by using a statistical package such as SAS, SPSS or R.

Certain datasets allow for instructors to demonstrate various statistical concepts. For example, the Birth To Ten datasets are actual data that illustrate Simpson's paradox. The Baby Boom.xls dataset allows us to examine a variety of distributions, including binomial, Poisson, and exponential. These types of datasets can assist students in transitioning from a theoretical understanding to pragmatic application.

In addition to their use in parametric statistical analyses, many of the datasets lend themselves to nonparametric analyses. A valuable exercise might be to have students analyze a dataset using both parametric and nonparametric procedures. The resulting discussion could focus on the importance of choosing the appropriate statistical analysis, as well as the impact of the violations of normality assumptions.

Large Scale Datasets

For large scale data analyses we have included the ECLS-K dataset. The Early Childhood Longitudinal Study – Kindergarten (ECLSK_sample) dataset is a subset of data from the ECLS-Kindergarten Class of 1998-99 (ECLS-K) Public Use Dataset () collected by the National Center for Education Statistics (West,

ksum.pdf). The complete dataset is available for public use, and is located at the NCES website along with more detailed User’s Guide information, statistical documentation, and user resources. The complete dataset includes data on a nationally representative sample of about 21,260 children enrolled in both private and public full-day and partial day kindergarten programs in the academic year 1998-99. The type of data includes child and parent demographic, child academic and behavioral, family environment, and classroom and school demographic variables.

The data file included in this disk is a subset of 97 academic, behavioral, demographic, and family environment variables (with 6 sample weighting variables and their associated 540 replicate weights) for a total of 643 variables. All 21,260 students are included in the dataset, thus the ECLSK_sample dataset contains the same sampling properties of the original public use dataset. In the original sampling, oversampling occurred for select subgroups such as Asian students and students in private kindergarten programs (West,

ksum.pdf). Thus, weighting variables are necessary for producing data that are representative of the 1998-99 national population. Additionally, the multi-stage sampling procedure used probability sampling from within primary sampling units. Because the sampling procedure allows for correlated samples, the within-group error variance is an underestimate of what would be found in the population, and subsequently, test statistics computed from the samples will be inflated. There are two common ways to adjust test statistics computed from the samples: the use of Design Effects or the use of re-estimation statistical packages such as SUDAAN () or WestVar (). Design effect estimates can be found in the ECLS-K User’s Guide.

The ECLSK_sample data file is recommended for use by students in moderate to advanced applied research methods and statistics courses; it is not recommended for students in introductory courses. The format of the variables requires students to utilize recoding procedures and provides opportunities for students to practice the creation of new variables by combining multiple related background and/or environmental variables. Weighting can be introduced to the students through the use of the sampling weights provided in the data file. Additionally, students can learn about the need for design effects with samples obtained by clustered or multi-stage sampling procedures and/or the use of jackknifing procedures with selection of the replicate weights provided.

The types of variables allow for a variety of statistical procedures including nonparametric statistics, multiple regression, analysis of variance, analysis of covariance, and multivariate analysis of variance procedures. Professors teaching courses that include multiple regression, multivariate analysis, measurement and evaluation, and large-scale database analysis may find the data file useful for classroom examples and student practice. Additionally, professors will be able to create numerous smaller datasets from the data file for classroom use.

Included in the ECLS-K folder are the data file in two formats (a dBase file and a SAS data file; an Excel file could not be used because of the 256 variable limit), a Microsoft© Word file of the variable codebook, and a SAS file listing the variable labels and format statements. The user will want to review the ECLS-K User’s Guide for more detailed information on sampling, data collection, variables, use of weights, design effects, and appropriate variance estimation procedures. The dBase (.dbf) file is recommended for use in WestVar.

Monte Carlo Simulations

If you have descriptive statistical information for a data set, but don’t actually have the data set, a very efficient method to help develop a practice or pilot research data set is through the use of Monte Carlo simulations. In Monte Carlo simulations a researcher uses the descriptive data to create “parallel” data sets that have the characteristics of the original data set. Further, the researcher can create an unlimited number of cases and conditions associated with this original data set.

The use of Monte Carlo simulations has traditionally been used in statistics and other related fields to evaluate the effectiveness of new methods and procedures. For example, a researcher develops a new statistical procedure, however this procedure needs to be checked under various conditions for discrepant sample size, normality and non-normality conditions. Collecting data or using archival data sets to evaluate the effectiveness of this new procedure under these various conditions would take a protracted amount of time. Further, issues of random sampling error for the archival data sets may also be problem. Thus, the researcher would use the collected and archival data sets and Monte Carlo simulations.

A Monte Carlo simulation using the Stanford Achievement Test, Version 10 (SAT-10) data is demonstrated. Descriptive information for the SAT-10 7th grade spring administration of the exam has been selected. Descriptive information needed to conduct this type of Monte Carlo simulation are the means, standard deviations, and the correlations among all the variables (See Table 2). The variables selected for this simulation are Reading Vocabulary, Reading Comprehension, Reading Total, Math Concepts, Math Problem Solving, and Math Total.

Table 2. Descriptive Statistics for SAT-10 7th Grade Spring Exam

_____________________________________________________________________________

Correlations

Variable Mean Std V1 V2 V3 V4 V5 V6

_____________________________________________________________________________

Reading:

Vocabulary (V1) 669.4 39.1 1.00 . . . . .

Comprehension (V2) 680.2 48.8 0.91 1.00 . . . .

Total (V3) 663.3 39.1 0.96 0.78 1.00 . . .

Math:

Concepts (V4) 668.6 37.9 0.71 0.65 0.68 1.00 . .

Problem Solving (V5) 666.2 37.6 0.69 0.64 0.66 0.95 1.00 .

Total (V6) 672.2 48.1 0.64 0.57 0.62 0.93 0.77 1.00

_____________________________________________________________________________

Using the following sample program written in SAS version 9.2 (See Figure 1) you can complete a Monte Carlo simulation of the SAT-10 Grade 7th data provided in Table 2. A data set called SAT 10 Macro.xls with 10,000 observations, generated from using the macro in Figure 1 is available on the provided CD.

This type of simulation process can also be extremely valuable for use in classroom environments. The last few lines of SAS code include a procedure called “Proc Surveyselect.” This procedure can be used to select random subsets of the data from the file SAT 10 Macro.xls. For this example, we have selected a sample of 200, with the data output to a file called “temp1.” This file, listed on the CD as Temp 1.xls, contains the 200 observations, randomly selected from SAT 10 Macro.xls. To confirm the macro is working effectively, the descriptive statistics for "temp1" are provided in Table 3. A comparison of the descriptive statistics from Table 2 with Table 3 provides the necessary evidence to confirm that “temp1" is a representative sample of the SAT-10 7th Grade achievement data.

Using Monte Carlo simulation procedures you can develop individualized data sets for students, complete pilot research work, or examine results for previous studies under the different conditions you place on the analyses.

Table 3. Descriptive Statistics for Monte Carlo Sample of 200 for SAT-10 7th Grade Fall Exam

_____________________________________________________________________________

Correlations

Variable Mean Std V1 V2 V3 V4 V5 V6

_____________________________________________________________________________

Reading:

Total (V1) 668.5 39.4 1.00 . . . . .

Vocabulary (V2) 680.6 49.3 0.91 1.00 . . . .

Comprehension (V3) 663.3 39.4 0.96 0.78 1.00 . . .

Math:

Total (V4) 668.3 37.9 0.72 0.65 0.69 1.00 . .

Concepts (V5) 666.0 37.6 0.70 0.64 0.66 0.95 1.00 .

Problem Solving (V6) 671.8 48.2 0.64 0.57 0.62 0.93 0.77 1.00

_____________________________________________________________________________

Sample printout from SAS

Examples of some of the SAS printout for selected analyses are included in Appendix A. They include a univariate analysis, SAS graph, correlation, and an ANOVA. These demonstrate how a standard statistical program will generate examples for discussion in class.

Conclusion and Distribution

The paper discussed the contents and structure of the CD datasets as well as suggestions for how some of the datasets can be utilized. The CD is free and you may use it in your teaching. Again, proper credit must be given to the appropriate source. For instance, at StatLib they use the statement: “If you use an algorithm, dataset, or other information from StatLib, please acknowledge both StatLib and the original contributor of the material.” For the NCES datasets they prefer the following citation: National Center for Education Statistics, U.S. Department of Education.

We hope these datasets will be helpful as you prepare your courses. We will continue to add additional datasets to the CD and will make them available to interested professionals. You may contact one of the authors at the University of Arkansas.

|Title of Data Set |Name on CD |n |Variables in Data Set |Source |

|1993 New Car Data |1993 Cars |93 |Manufacturer, Model, Type, Minimum price, Midrange price, Maximum |Consumer Reports: The 1993 Cars-Annual Auto Issue (April), |

| | | |price, City MPG, Highway MPG, Air bags standard, Drive train type,|Yonkers: Consumers Union. PACE New Car & Truck 1993 Buying |

| | | |Number of cylinders, Engine size, Horsepower, RPM, Engine |Guide. Milwaukee: Pace Publications. Quoted in Lock, R. H. |

| | | |revolutions per mile, Manual transmission available, Fuel tank |(1993). 1993 New Car Data. Journal of Statistics Education, |

| | | |capacity, Passenger capacity, Length, Wheelbase, Width, U-turn |1(1). |

| | | |space, Rear seat room, Luggage capacity, Weight, Domestic | |

| | | |manufacturing | |

|1994 AAUP Faculty Salary Data |AAUP |1161 |Federal ID number, College Name, State, Type, Avg. salary—full |March-April 1994 issue of Academe. Submitted to the Journal |

| | | |professors, Avg. salary—associate professors, Avg. |of Statistics Education by Robin Lock. |

| | | |salary—assistant professors, Avg. salary—all ranks, Avg. | |

| | | |compensation—full professors, Avg. compensation—associate | |

| | | |professors, Avg. compensation—assistant professors, Avg. | |

| | | |compensation—all ranks, Number of full professors, Number of | |

| | | |associate professors, Number of assistant professors, Number of | |

| | | |Instructors, Number of faculty—all ranks | |

|2004 New Car and Truck Data |2004 Cars |428 |Vehicle name, Sports car, SUV, Wagon, Minivan, Pickup, All-wheel |Kiplinger's Personal Finance, December 2003, vol. 57, no. |

| | | |drive, Rear-wheel drive, Suggested retail price, Dealer price, |12, pp. 104-123, http:/. Submitted to the |

| | | |Engine size, Number of cylinders, Horsepower, City MPG, Highway |Journal of Statistics Education by Roger W. Johnson |

| | | |MPG, Weight, Wheel base, Length, Width | |

|A Dataset That Is 44% Outliers |Outlier |43 |President name, Number of days in office |2001 World Almanac. Quoted in Hayden, R. W. (2005). A |

| | | | |dataset that is 44% outliers. Journal of Statistics |

| | | | |Education, 13(1). |

|Abortion Opinion Data |Abortion Opinion |2385 |Race, Gender, Age, Opinion |Christensen, R. (1990). Log-linear models. New York: |

| | | | |Springer-Verlag. |

|Absentee and Machine Ballot Votes |Philadelphia Voting |22 |Year of election, District number, Democrat absentee vote in |Orley Ashenfelter. Quoted in Chatterjee, S., Handcock, M. |

|in Philadelphia Elections | | |district, Republican absentee vote in district, Democrat machine |S., & Simonoff, J. S. (1995). A casebook for a first course |

| | | |vote in district, Republican machine vote in district |in statistics and data analysis. New York: John Wiley. |

|Advertising Pages and Advertising |Advertising Pages |41 |Name of publication, Number of advertising pages in hundreds, |Chatterjee, S., & Price, B. (1991). Regression analysis by |

|Revenue in 1986 | | |Advertising revenue in millions of dollars |example (2nd ed.). New York: John Wiley. |

|Annual Data on Advertising, |Advertising |22 |Advertising expenditures, Promotion expenditures, Sales expense, |Chatterjee, S., & Price, B. (1991). Regression analysis by |

|Promotions, Sales Expenses, and | | |Sales, Previous year's advertising expenditures, Previous year's |example (2nd ed.). New York: John Wiley. |

|Sales | | |promotion expenditures | |

|Annual Return Rates in the Stock |Stock Market |18 |Year, Standard and Poor’s Index year end value, Vanguard Index |Vanguard Market Index Trust 500--Portfolio Annual Report, |

|Market, 1976-1993 | | |Trust 500 Portfolio year end value |1993 (p. 7). Quoted in Chatterjee, S., Handcock, M. S., & |

| | | | |Simonoff, J. S. (1995). A casebook for a first course in |

| | | | |statistics and data analysis. New York: John Wiley. |

|Attitude Survey Data |Employee Satisfaction |30 |Overall rating of job being done by supervisor, Handles employee |Chatterjee, S., & Price, B. (1991). Regression analysis by |

| | | |complaints, Does not allow special privileges, Opportunity to |example (2nd ed.). New York: John Wiley. |

| | | |learn new things, Raises based on performances, Too critical of | |

| | | |poor performances; Rate of advancing to better jobs | |

|Average Monthly Air Temperature in|Average Temperature |120 |Month, Year, Average air temperature for a given month | |

|Recife, Brazil, 1953-1962 | | | |Recife.TS |

|Ball Bearing Reliability Data |Ball Bearings |210 |Company code, Test number, Year of test, Number of bearings, Load,|Lieblein and Zelen (1956). Statistical investigation of the |

| | | |Number of balls, Diameter, L10, L50, Weibull slope, Bearing type |fatigue life of deep-groove ball bearings. Quoted in Caroni |

| | | | |(2002). Modeling the reliability of ball bearings. Journal |

| | | | |of Statistics Education, 10(3). |

|Baseline Data for Mayo Clinic |Baseline Cirrhosis |418 |ID; Number of days between registration and the earlier of death, |Fleming, T. R., & Harrington, D. P. (1991). Counting |

|Trial in Primary Biliary Cirrhosis| | |transplantion, or study analysis time in July, 1986; Death status;|processes and survival analysis. New York: Wiley. |

|(PBC) of the Liver | | |Drugs administered; Age; Sex; Presence of ascites; Presence of | |

| | | |hepatomegaly; Presence of spiders; Presence of edema; Serum | |

| | | |bilirubin; Serum cholesterol; Albumin; Urine copper; Alkaline | |

| | | |phosphatase; SGOT; Triglycerides; Platelets; Prothrombin time; | |

| | | |Histologic stage of disease | |

|Betting on Professional Football |NFL |672 |Name of favored team, Name of underdog team, Betting result, Day |Compiled by Hal Stern. Submitted to the Statlib facility by |

|Results for 1989-1991 | | |and time of game, Favored team at home or away, Week of season, |Robin Lock. Quoted in Chatterjee, S., Handcock, M. S., & |

| | | |Year |Simonoff, J. S. (1995). A casebook for a first course in |

| | | | |statistics and data analysis. New York: John Wiley. |

|Birth to Ten Study: An Example of |Birth to Ten A (Note: This data set |1590 |Medical aid given to mother, Mother traced for 5 year interview, |Chronic Diseases of Lifestyle Programme at the Medical |

|Simpson's Paradox |contains the same information as Birth| |Race, Frequency |Research Council in Cape Town, South Africa. Quoted in |

| |to Ten B in a different format.) | | |Morrell, C. H. (1999). Simpson's paradox: An example from a |

| | | | |longitudinal study in South Africa. Journal of Statistics |

| | | | |Education, 7(3). |

|Birth to Ten Study: An Example of |Birth to Ten B (Note: This data set |1590 |Medical aid given to mother, Mother traced for 5 year interview, |Chronic Diseases of Lifestyle Programme at the Medical |

|Simpson's Paradox |contains the same information as Birth| |Race |Research Council in Cape Town, South Africa. Quoted in |

| |to Ten A in a different format.) | | |Morrell, C. H. (1999). Simpson's paradox: An example from a |

| | | | |longitudinal study in South Africa. Journal of Statistics |

| | | | |Education, 7(3). |

|Building Characteristics and Sales|Property Valuation |24 |Taxes, Number of bathrooms, Lot size, Living space, Number of |Narula, S. C., & Wellington, J. F. (1977). Technometrics, |

|Price | | |garage stalls, Number of rooms, Number of bedrooms, Age of the |19 (2). Quoted in Chatterjee, S., & Price, B. (1991). |

| | | |home, Number of fireplaces, Sale price |Regression analysis by example (2nd ed.). |

| | | | |New York: John Wiley. |

|Calcium, Inorganic Phosphorus and |Calcium (Note: This dataset |178 |Patient observation number, Age in years, sex; Alkaline |Boyd, J., Delost, M., and Holcomb, J. (1998). Calcium, |

|Alkaline Phosphatase Levels in |intentionally has errors so that | |phosphatase international units/liter, Lab name, Calcium mmol/L, |phosphorus, and alkaline phosphatase laboratory values of |

|Elderly Patients |students may practice cleaning data. | |Inorganic phosphorus mmol/L, Age group |elderly subjects. Clinical Laboratory Science, 11. Quoted in|

| |The cleaned dataset is Calciumgood.) | | |Holcomb, J., and Spalsbury, A. (2005). Journal of Statistics|

| | | | |Education, 13(3). |

|Calcium, Inorganic Phosphorus and |Calciumgood |178 |Patient observation number, Age in years, sex; Alkaline |Boyd, J., Delost, M., and Holcomb, J. (1998). Calcium, |

|Alkaline Phosphatase Levels in | | |phosphatase international units/liter, Lab name, Calcium mmol/L, |phosphorus, and alkaline phosphatase laboratory values of |

|Elderly Patients--Cleaned Dataset | | |Inorganic phosphorus mmol/L, Age group |elderly subjects. Clinical Laboratory Science, 11. Quoted in|

| | | | |Holcomb, J., and Spalsbury, A. (2005). Journal of Statistics|

| | | | |Education, 13(3). |

|Cigarette Consumption Data by |Cigarette Consumption |51 |State; Median age; Percentage of people over 25 years of age who |Chatterjee, S., & Price, B. (1991). Regression analysis by |

|State, 1970 | | |had completed high school; Per capita personal income; Percentage |example (2nd ed.). New York: John Wiley. |

| | | |of blacks; Percentage of females; Weighted average price of a pack| |

| | | |of cigarettes; Number of packs of cigarettes sold on a per capita | |

| | | |basis | |

|Cloud Seeding Data |Cloud Seeding |24 |Action, Day number, Seeding suitability, Echo coverage, |Woodley, W. L., Simpson, J., Biondini, R., & Berkeley, J. |

| | | |Prewetness, Echo motion, Amount of rain |(1977). Rainfall results 1970-75: Florida area cumulus |

| | | | |experiment. Science, 195, 735-42. Quoted in Cook, R. D., & |

| | | | |Weisberg, S. (1982). Residuals and influence in regression. |

| | | | |New York: Chapman and Hall. |

|Cloud-seeding Experiment in |Rainfall |108 |Period, Seeding status, Season, East target area rainfall, West |Miller, A. J., Shaw, D. E., Veitch, L. G. & Smith, E. J. |

|Tasmania Between Mid-1964 and | | |target area rainfall, North control area rainfall, South control |(1979). Analyzing the results of a cloud-seeding experiment |

|January 1971 | | |area rainfall, Northwest control area rainfall |in Tasmania. Communications in Statistics - Theory & |

| | | | |Methods, vol. A8(10), 1017-1047. |

|Comparison of Changes in Exchange |Exchange Rates |44 |Country name, Change in exchange rate 1975-1990, Change in |International Financial Statistics Yearbook. Quoted in |

|Rates and Differences in Inflation| | |exchange rate 1985-1990, Change in inflation rates 1975-1990, |Chatterjee, S., Handcock, M. S., & Simonoff, J. S. (1995). A|

|Rates for Various Countries | | |Change in inflation rates 1985-1990 |casebook for a first course in statistics and data analysis.|

| | | | |New York: John Wiley. |

|Comparison of Health Care Spending|Health Care Spending |50 |State, Census Bureau region of the state, Census Bureau region |The New York Times. October 15, 1993. Quoted in Chatterjee,|

|Across the United States | | |number, Per capita health spending, Percent of per capita income |S., Handcock, M. S., & Simonoff, J. S. (1995). A casebook |

| | | |spent on health |for a first course in statistics and data analysis. New |

| | | | |York: John Wiley. |

|Comparison of Productivity and |Japanese Autos |27 |Assembly defects per 100 cars, Hours per vehicle, National origin |Womack, J. P., Jones, D. T., & Roos, D. (1990). The machine |

|Quality in Japanese and | | |of facility, Assembly defects per 100 cars (non-Japanese origin), |that changed the world. New York: Rawson. Quoted in |

|Non-Japanese Automobile | | |Assembly defects per 100 cars (Japanese origin), Hours per vehicle|Chatterjee, S., Handcock, M. S., & Simonoff, J. S. (1995). A|

|Manufacturing | | |(non-Japanese origin), Hours per vehicle (Japanese origin) |casebook for a first course in statistics and data analysis.|

| | | | |New York: John Wiley. |

|Consumer Expenditure and Money |Consumer Expenditure |20 |Quarter, Consumer expenditure, money stock |Friedman, M., & Meiselman, D. (1963). Commission on money |

|Stock 1952-1956 | | | |and credit, stabilization policies. Englewood Cliffs, NJ: |

| | | | |Prentice Hall. Quoted in Chatterjee, S., & Price, B. (1991).|

| | | | |Regression analysis by example (2nd ed.). New York: John |

| | | | |Wiley. |

|County Data from the 2000 |Florida Voting 2000 |67 |County, Type of voting machine used, Column format of ballot, | |

|Presidential Election in Florida | | |Undervote count, Overvote count, Votes counted for Bush, Gore, | |

|(Excluding Federal Absentee Votes)| | |Browne, Nader, Harris, Hagelin, Buchanan, McReynolds, Phillips, | |

| | | |Moorehead, Chote, McCarthy | |

|Data on French Economy; IMPORT |French Economy |18 |Year, Imports, Domestic production, Stock formation, Domestic |Malinvaud, E. (1968). Statistical methods in econometrics. |

|Data (Billions of French Francs) | | |consumption |Chicago: Rand McNally. |

| | | | |Quoted in Chatterjee, S., & Price, B. (1991). Regression |

| | | | |analysis by example (2nd ed.). New York: John Wiley. |

|Diameter, Height, and Volume of |Cherry Trees |31 |Diameter, Height, Volume |Ryan, T., Joiner, B., & Ryan, B. (1976). Minitab student |

|Black Cherry Trees in Allegheny | | | |handbook. North Scituate, MA: Duxbury Press. Quoted in Cook,|

|National Forest, Pennsylvania | | | |R. D., & Weisberg, S. (1982). Residuals and influence in |

| | | | |regression. New York: Chapman and Hall. |

|Diamond Pricing with Dummy |Diamond Pricing with Dummy Variables |308 |Carat, Indicator for color D, Indicator for color E, Indicator for|Chu, S. (2001). Pricing the C's of diamond stones. Journal |

|Variables | | |color F, Indicator for color G, Indicator for color H, Indicator |of Statistics Education, 9(2). |

| | | |for clarity IF, Indicator for clarity VVS1, Indicator for clarity | |

| | | |VVS2, Indicator for clarity VS1, Indicator for certification body | |

| | | |GIA, Indicator for certification body IGI, Indicator for medium | |

| | | |stones, Indicator for large stones, Interaction variable | |

| | | |med*carat, Interaction variable large*carat, Carat squared, Price | |

| | | |in Singapore dollars, Ln(Price) | |

|Disposable Income and Ski Sales |Ski Sales 1 |40 |Quarter, Ski sales, Personal disposable income |Chatterjee, S., & Price, B. (1991). Regression analysis by |

|for Years 1964-1974 | | | |example (2nd ed.). |

| | | | |New York: John Wiley. |

|Disposable Income, Ski Sales, and |Ski Sales 2 |40 |Quarter, Ski sales, Personal disposable income, Season |Chatterjee, S., & Price, B. (1991). Regression analysis by |

|Seasonal Variables for Years | | | |example (2nd ed.). |

|1964-1974 | | | |New York: John Wiley. |

|Distribution for Males and Females|Swedish Birth Dates |12 |Month, Number of females born, Number of males born |Cramer, H. (1946). Mathematical methods of statistics. |

|Born in Sweden in 1935 | | | |Princeton: Princeton University Press. Quoted in |

| | | | |Christensen, R. (1990). Log-linear models. New York: |

| | | | |Springer-Verlag. |

|Distribution of White Student |White Enrollment |56 |District, Proposed legislative district, Total public school |Newsday, May 20, 1994. Quoted in Chatterjee, S., Handcock, |

|Enrollment in Nassau County School| | |enrollment, White student enrollment |M. S., & Simonoff, J. S. (1995). A casebook for a first |

|Districts | | | |course in statistics and data analysis. New York: John |

| | | | |Wiley. |

|Dow Jones Industrial Average and |Dow Jones |161 |Date, Dow Jones Industrial Average at the close of the day, |Chatterjee, S., Handcock, M. S., & Simonoff, J. S. (1995). A|

|the S & P 500 Index Values Weekly | | |Standard and Poor’s 500 Stock Index at the close of the day |casebook for a first course in statistics and data analysis.|

|From February 1, 1991 to February | | | |New York: John Wiley. |

|25, 1994 | | | | |

|Drill Bit Performance Over a Range|Drill Bit Data |31 |Speed of rotation, Feed rate, Diameter of drill bit, Axial load on|M. R. Delozier of Kennametal, Inc., Latrobe, Pennsylvania. |

|of Drilling Conditions | | |drill bit |Quoted in Cook, R. D., & Weisberg, S. (1982). Residuals and |

| | | | |influence in regression. New York: Chapman and Hall. |

|Drug Dosage Retained in Rat Livers|Rat Data |19 |Body weight, Liver weight, Relative dose, Percentage of dose |Weisberg, S. (1980). Applied Linear Regression. New York: |

| | | |retained in liver |Wiley. Quoted in Cook, R. D., & Weisberg, S. (1982). |

| | | | |Residuals and influence in regression. New York: Chapman and|

| | | | |Hall. |

|Early Childhood Longitudinal Study|ECLSK_sample.sas7bdat |21260 |See ECLSK_sample codebook.doc (643 variables available) |National Center for Education Statistics, U.S. Department of|

|(ECLS-K) Data | | | |Education; accessed at |

|Effectiveness of Blast Furnace |Agricultural Data |7 |Treatment, Soil type, Corn yield |Carter, O. R., Collier, B. L., & Davis, F. L. (1951). Blast |

|Slags as Agricultural Liming | | | |furnace slags as agricultural liming materials. Agronomy |

|Materials on Three Soil Types | | | |Journal, 43, 430-433. Quoted in Cook, R. D., & Weisberg, S. |

| | | | |(1982). Residuals and influence in regression. New York: |

| | | | |Chapman and Hall. |

|Emergency Calls to the New York |Auto Calls |28 |Date, Emergency road service calls answered, Forecast high |New York Motorist. (March 1994). Automobile Club of New |

|Auto Club in January 1993 and | | |temperature, Forecast low temperature, Daily high temperature, |York. Quoted in Chatterjee, S., Handcock, M. S., & Simonoff,|

|January 1994 | | |Daily low temperature, Rain forecast, Snow forecast, Type of day, |J. S. (1995). A casebook for a first course in statistics |

| | | |Year, Sunday, Subzero temperature |and data analysis. New York: John Wiley. |

|Equal Educational Opportunity |EEO Data |70 |Family, Peer, School Achievement |Chatterjee, S., & Price, B. (1991). Regression analysis by |

|(EEO) Data; Standardized Indexes | | | |example (2nd ed.). New York: John Wiley. |

|Eruption Durations and |Old Faithful |222 |Date, Duration of eruption, Time until next eruption |Weisberg, S. (1985). Applied linear regression (2nd ed.). |

|Intereruption Times for the "Old | | | |New York: John Wiley. Quoted in Chatterjee, S., Handcock, M.|

|Faithful" Geyser in Yellowstone | | | |S., & Simonoff, J. S. (1995). A casebook for a first course |

|National Park | | | |in statistics and data analysis. New York: John Wiley. |

|Excretion of Steroids in Patients |Cushing’s Syndrome |21 |Type of Cushing’s syndrome, Levels of tetrahydrocortisone, Levels |Aitchison, J., & Dunsmore, I. R. (1975). Statistical |

|with Cushing's Syndrome | | |of pregnanetriol |prediction analysis. Cambridge: Cambridge University Press.|

| | | | |Quoted in Christensen, R. (1990). Log-linear models. New |

| | | | |York: Springer-Verlag. |

|Financial Ratios of Solvent and |Financial Ratios |66 |(working capital)/(total assets), (retained earnings)/(total |Chatterjee, S., & Price, B. (1991). Regression analysis by |

|Bankrupt Firms | | |assets), (earnings before interest and taxes)/(total assets), |example (2nd ed.). New York: John Wiley. |

| | | |(market-value equity)/(book value of total liabilities), | |

| | | |sales/(total assets), bankruptcy status | |

|Forced Expiratory Volume of |FEV |654 |Age, Forced Expiratory Volume (FEV), Height, Sex, Smoking status |Rosner, B. (1999), Fundamentals of Biostatistics, 5th Ed., |

|Smokers and Non-smokers | | | |Pacific Grove, CA: Duxbury. Quoted in Kahn, M. (2005). An |

| | | | |exhalent problem for teaching statistics. Journal of |

| | | | |Statistics Education, 13(2). |

|Fuel Consumption and Automotive |Fuel Consumption |30 |Miles/gallon, Displacement, Horsepower, Torque, Compression ratio,|Motor Trend magazine, 1975. Quoted in Chatterjee, S., & |

|Variables | | |Rear axle ratio, Carburetor (barrels), Number of transmission |Price, B. (1991). Regression analysis by example (2nd ed.). |

| | | |speeds, Overall length, Width, Weight, Type of transmission |New York: John Wiley. |

|Gesell Adaptive Score and Age at |First Word |21 |Age at first word, Gesell adaptive score |Mickey, M. R., Dunn, O. J., & Clark, V. (1967). Note on the |

|First Word | | | |use of stepwise regression in detecting outliers. Computers |

| | | | |& Biomedical Research, 1, 105-9. Quoted in Cook, R. D., & |

| | | | |Weisberg, S. (1982). Residuals and influence in regression. |

| | | | |New York: Chapman and Hall. |

|Graduate Admissions at Berkeley |Berkeley Graduate Admissions |4526 |Department, Gender, Admission status |Bickel, P. J., Hammel, E. A., & O'Conner, J. W. (1975). Sex |

| | | | |bias in graduate admissions: Data from Berkeley. Science, |

| | | | |187, 398-404. Quoted in Christensen, R. (1990). Log-linear |

| | | | |models. New York: Springer-Verlag. |

|Jet Fighter Data |Jet Fighter |22 |Aircraft ID, First flight date, Specific power, Flight range |Stanley, W., & Miller, M. (1979). Measuring technological |

| | | |factor, Payload, Sustained load factor, Carrier capability |change in jet fighter aircraft. Report No. R-2249-AF. Santa|

| | | | |Monica: Rand Corp. Quoted in Cook, R. D., & Weisberg, S. |

| | | | |(1982). Residuals and influence in regression. New York: |

| | | | |Chapman and Hall. |

|Lead Rating and News Rating of |Television Ratings |30 |Lead rating, News rating |Chatterjee, S., & Price, B. (1991). Regression analysis by |

|Television Data | | | |example (2nd ed.). |

| | | | |New York: John Wiley. |

|Length of Computer Service Calls |Service Calls 1 |14 |Units, Minutes |Chatterjee, S., & Price, B. (1991). Regression analysis by |

|and Number of Units Repaired | | | |example (2nd ed.). |

| | | | |New York: John Wiley. |

|Length of Computer Service Calls |Service Calls 2 |24 |Units, Minutes |Chatterjee, S., & Price, B. (1991). Regression analysis by |

|and Number of Units | | | |example (2nd ed.). |

|Repaired--Expanded Sample | | | |New York: John Wiley. |

|Length of Visits to on |msnbclength |50,000 |Length of visit |Internet Information Server logs for and |

|September 28, 1999 | | | |news-related portions of . Quoted by Sanchez, J. and |

| | | | |He, Y. (2005). Internet data analysis for the undergraduate |

| | | | |statistics curriculum. Journal of Statistics Education, |

| | | | |13(3). |

|Leukemia Data for Patients |Leukemia Data AG Positive |17 |White blood cell count, Survival time |Feigl, P., & Zelen, M. (1965). Estimation of exponential |

|Diagnosed as AG Positive | | | |probabilities with concomitant information. Biometrics, 21, |

| | | | |826-838. Quoted in Cook, R. D., & Weisberg, S. (1982). |

| | | | |Residuals and influence in regression. New York: Chapman and|

| | | | |Hall. |

|Leukemia Data for Patients |Leukemia Data |30 |White blood cell count, AG status, Number of patients surviving at|Feigl, P., & Zelen, M. (1965). Estimation of exponential |

|Diagnosed as AG Positive or AG | | |least 52 weeks, Number of patients in each combination of WBC and |probabilities with concomitant information. Biometrics, 21,|

|Negative | | |AG |826-838. Quoted in Cook, R. D., & Weisberg, S. (1982). |

| | | | |Residuals and influence in regression. New York: Chapman and|

| | | | |Hall. |

|Los Angeles Heart Study Data |Chapman Data |200 |Age, Systolic blood pressure, Diastolic blood pressure, |Dixon, W. J., & Massey, F. J., Jr. (1983). Introduction to |

| | | |Cholesterol, Height, Weight, Coronary incident |statistical analysis. New York: McGraw-Hill. Quoted in |

| | | | |Christensen, R. (1990). Log-linear models. New York: |

| | | | |Springer-Verlag. |

|Lug Counts from Vineyard Harvest |Lug Counts |52 |Row number, Number of lugs for 1983, Number of lugs for 1984, |Barnhill family archives, 1976-1991. Quoted in Chatterjee, |

|by Row and Year of Harvest | | |Number of lugs for 1985, Number of lugs for 1986, Number of lugs |S., Handcock, M. S., & Simonoff, J. S. (1995). A casebook |

| | | |for 1987, Number of lugs for 1988, Number of lugs for 1989, Number|for a first course in statistics and data analysis. New |

| | | |of lugs for 1990, Number of lugs for 1991 |York: John Wiley. |

|Major League Baseball Hall of Fame|MLBHOF |1340 |Player name, Number of seasons played, Games played, Official |The Baseball Encyclopedia and Total Baseball. Quoted in |

| | | |at-bats, Runs scored, Hits, Doubles, Triples, Home runs, Runs |Cochran, J. (2000). Career records for all modern position |

| | | |batted in, Walks, Strikeouts, Career batting average, On base |players eligible for the Major League Baseball Hall of Fame.|

| | | |percentage, Slugging |Journal of Statistics Education, 8(2). |

| | | |percentage, Adjusted production, Batting runs, Adjusted batting | |

| | | |runs, Runs created, Stolen bases, Times caught stealing, Stolen | |

| | | |base runs, Fielding average, Fielding runs, Primary position | |

| | | |played, Total player rating, Hall of Fame Status | |

|Mayo Clinic Trial in Primary |Cirrhosis |312 (data |ID; Number of days between registration and the earlier of death, |Fleming, T. R., & Harrington, D. P. (1991). Counting |

|Biliary Cirrhosis (PBC) of the | |given for |transplantion, or study analysis time in July, 1986; Death status;|processes and survival analysis. New York: Wiley. |

|Liver, 1974-1984 | |1945 visits) |Drugs administered; Age; Sex; Number of days between enrollment | |

| | | |and this visit date; Presence of ascites; Presence of | |

| | | |hepatomegaly; Presence of spiders; Presence of edema; Serum | |

| | | |bilirubin; Serum cholesterol; Albumin; Alkaline phosphatase; SGOT;| |

| | | |Platelets; Prothrombin time; Histologic stage of disease | |

|Monte Carlo Simulation |Sample Monte Carlo Simulation |10,000 |7th Grade SAT-10: Reading vocabulary, Reading comprehension, |Simulated data based on SAT-10 means, standard deviations, |

| |Program.doc | |Reading total, Math concepts, Math problem solving, and Math total|and correlations |

|Monthly Domestic Electricity |Electricity |55 |Month of observation, Year of observation, Average daily usage, |Handcock family archives, August 1989-February 1994. Quoted|

|Consumption at Different | | |Average daily temperature |in Chatterjee, S., Handcock, M. S., & Simonoff, J. S. |

|Temperatures | | | |(1995). A casebook for a first course in statistics and |

| | | | |data analysis. New York: John Wiley. |

|Monthly Sunspots Numbers from 1740|Sunspots |2820 |Year, Number of sunspots per month (January-December) | |

|to 1983 | | | | |

|Number of Deaths by Horsekicks in |Horsekick Deaths |20 |Year, Corp1-Corp14, Total |Andrews, D. F., & Herzberg, A. M. (1985). Data. |

|the Prussian Army from 1875-1894 | | | |Springer-Verlag: New York. Accessed at Statlib, |

|for 14 Corps | | | | |

|Number of Supervised Workers and |Number of Supervised Workers |27 |Number of supervised workers, Number of supervisors |Chatterjee, S., & Price, B. (1991). Regression analysis by |

|Supervisors in 27 Industrial | | | |example (2nd ed.). |

|Establishments | | | |New York: John Wiley. |

|Number of Surviving Bacteria |Bacteria Death Rates |15 |Interval, Number of bacteria |Chatterjee, S., & Price, B. (1991). Regression analysis by |

|Following Exposure to 200-Kilovolt| | | |example (2nd ed.). New York: John Wiley. |

|X-rays at 6-minute Intervals | | | | |

|Numbers of Reported Sexual |Sexual Partners |3533 |Male, Female |The general social survey, 1989-1991. Quoted in Chatterjee,|

|Partners of a Sample of Males and | | | |S., Handcock, M. S., & Simonoff, J. S. (1995). A casebook |

|Females | | | |for a first course in statistics and data analysis. New |

| | | | |York: John Wiley. |

|Occupations of Family Heads for |Religion and Occupation |3966 |Religious affiliation, Occupation, Number for each category |Lazerwitz, B. (1961). A comparison of major United States |

|Families of Various Religious | | | |religious groups. Journal of the American Statistical |

|Groups | | | |Association, 56, 568-579. Quoted in Christensen, R. (1990). |

| | | | |Log-linear models. New York: Springer-Verlag. |

|Perceptions of the New York City |New York Subway |62 |Usage of subway, Cleanliness of stations, Cleanliness of trains, |Survey conducted at the Leonard N. Stern School of Business,|

|Subway System | | |Safety in station, Safety on trains, Rush hour crowding in |Spring 1994. Quoted in Chatterjee, S., Handcock, M. S., & |

| | | |stations, Rush hour crowding on trains, In-station information, |Simonoff, J. S. (1995). A casebook for a first course in |

| | | |On-train announcements, Convenience of train stops, Convenience of|statistics and data analysis. New York: John Wiley. |

| | | |train schedule, Speed of travel, Frequency of trains, Ease of | |

| | | |token purchase, Ease of token collection, Police presence in | |

| | | |stations, Police presence on trains, Availability of maps, Number | |

| | | |of uses per week | |

|Performance of National Basketball|NBA |105 |Player’s name, Player’s height, Number of games appeared in, Total|Cohn, J. (1994). The pro basketball bible. San Diego: |

|Association Guards | | |minutes played, Player’s age, Points scored per game, Assists per |Basketball Books Ltd. Quoted in Chatterjee, S., Handcock, M.|

| | | |game, Rebounds per game, Percent of field goals made, Percent of |S., & Simonoff, J. S. (1995). A casebook for a first course |

| | | |free throws made |in statistics and data analysis. New York: John Wiley. |

|Presidential Election Data, |Election |19 |Year, Democratic share of the two-party vote, Party of incumbent, |Fair, R. C. (1988). The effect of economic events on votes |

|1916-1988 | | |Party of incumbent running for election, Growth rate of real per |for president: 1984 update. Political Behavior, 10, |

| | | |capita GNP in the second and third quarters of the election year, |168-178. Quoted in Chatterjee, S., & Price, B. (1991). |

| | | |Absolute value of the rate of inflation in the 2-year period prior|Regression analysis by example (2nd ed.). New York: John |

| | | |to the election |Wiley. |

|Pricing the C’s of Diamond Stones |Diamond Pricing |308 |Carat, Color, Clarity, Certification body, Price in Singapore |Singapore's Business Times, February 18, 2000. Quoted in |

| | | |dollars |Chu, S. (2001). Pricing the C's of diamond stones. Journal |

| | | | |of Statistics Education, 9(2). |

|Relationship Between Instructor's |Intelligence Clothing Standard |1725 |Intelligence rating, Clothing rating, School standard, Number for |Gilby, W. H. (1911). On the significance of the teacher's |

|Evaluation of General | | |each category (Dataset includes three partitioning tables) |appreciation of general intelligence. Biometrika, VII, |

|Intelligence, Quality of Clothing,| | | |79-93. Quoted in Christensen, R. (1990). Log-linear models. |

|and School Standard | | | |New York: Springer-Verlag. |

|Relationship Between STAR Reading |STAR |150 |Gender, STAR reading scaled score, STAR math scaled score, SAT-9 |Randomly generated data |

|and Math and SAT-9 Reading, Math, | | |reading scaled score, SAT-9 math scaled score, SAT-9 language | |

|and Language | | |scaled score | |

|Salary Survey Data of Computer |Salary of Computer Pros |46 |Education, Experience, Management responsibility, Salary |Chatterjee, S., & Price, B. (1991). Regression analysis by |

|Professionals in a Large | | | |example (2nd ed.). |

|Corporation | | | |New York: John Wiley. |

|Sample of 200 Observations from |Temp 1 |200 |Reading total score, Reading vocabulary score, Reading |National Office for Research on Measurement and Evaluation |

|SAT-10 Monte Carlo Simulation | | |comprehension score, Math total score, Math concepts score, Math |Systems (NORMES), University of Arkansas |

| | | |problem solving score | |

|SAT-10 Monte Carlo Simulation Data|SAT 10 Macro |10,000 |Reading total score, Reading vocabulary score, Reading |National Office for Research on Measurement and Evaluation |

| | | |comprehension score, Math total score, Math concepts score, Math |Systems (NORMES), University of Arkansas |

| | | |problem solving score | |

|Scores for Students Expected to |80% Mastery |50 |ID, Score |Randomly generated data based on the binomial distribution; |

|Reach 80% Mastery Criterion on a | | | |corresponding data set found in Random Guessing.xls |

|45 item Test with 5 Options Per | | | | |

|Item | | | | |

|Scores for Students with Random |Random Guessing |50 |ID, Score |Randomly generated data based on the binomial distribution; |

|Guessing on a 45 Item Test with 5 | | | |corresponding data set found in 80% Mastery.xls |

|Options Per Item | | | | |

|Scores on a Multiple Choice and |Literacy Test |4999 |ID, Gender, Race, Free and reduced lunch participation, |Randomly generated data |

|Open Response Literacy Exam | | |Performance class, Scaled score, Multiple choice items 1-24, | |

| | | |Multiple choice scores for strands 1-3, Total multiple choice | |

| | | |score, Open ended scores for strands 1-3, Total open ended score, | |

| | | |Total raw score | |

|Simulated Scores for Grades 3-5 on|Arkansas Math |216 |Special services code, Free and reduced price lunch participation,|National Office for Research on Measurement and Evaluation |

|Arkansas Math Benchmark Exam | | |Limited English proficiency classification, Race, Gender, Grade, |Systems (NORMES), University of Arkansas |

| | | |Math proficiency class, Mobility status, Multiple choice score, | |

| | | |Open response score, Total math raw score, Teacher, Multiple | |

| | | |choice and open response scores by 5 math strands (Number Sense, | |

| | | |Geometry, Measurement, Data Analysis, and Patterns and Algebraic | |

| | | |Functions), Total math scaled score | |

|Sleep in Mammals |Animal Sleep |62 |Species of animal, Body weight, Brain weight, Slow wave |Allison, T., & Cicchetti, D. V. (1976). Sleep in mammals: |

| | | |("nondreaming") sleep, Paradoxical ("dreaming") sleep, Total |Ecological and constitutional correlates. Science, 194, |

| | | |sleep, Maximum life span, Gestation time, Predation index, Sleep |732-734. |

| | | |exposure index, Overall danger index | |

|State Expenditures on Education |State Education Expenditures |50 |State, Number of residents per thousand living in urban areas in |Chatterjee, S., & Price, B. (1991). Regression analysis by |

| | | |1970, Per capita expenditure on education projected for 1975, Per |example (2nd ed.). New York: John Wiley. |

| | | |capita income in 1973, Number of residents per thousand under 18 | |

| | | |years of age in 1974, Geographic region | |

|The Return on Stocks in Over the |NYSE OTC |30 |Weekly return of NASDAQ stocks, Weekly return of NYSE stocks |Chatterjee, S., Handcock, M. S., & Simonoff, J. S. (1995). A|

|Counter Market and New York Stock | | | |casebook for a first course in statistics and data analysis.|

|Exchange, May 9-May 13, 1994 | | | |New York: John Wiley. |

|Time of Birth, Sex, and Weight of |Baby Boom |44 |Time of birth, Sex, Birth Weight, Minutes after midnight of birth |Brisbane Sunday Mail, Dec. 21, 1997. Quoted in Dunn, P. |

|44 Babies Born in One Hospital in | | | |(1999). A simple dataset for demonstrating common |

|a 24 Hour Period | | | |distributions. Journal of Statistics Education, 7(3). |

|U.S. Airport Statistics |Airports |135 |Airport, City, Scheduled departures, Performed departures, |U.S. Federal Aviation Administration and Research and |

| | | |Enplaned passengers, Enplaned revenue tons of frieght, Enplaned |Special Programs Administration, 'Airport Activity |

| | | |revenue tons of mail |Statistics' (1990). Submitted to the Journal of Statistics |

| | | | |Education by Larry Winner. |

|U.S. Senate Votes for Clinton |Impeachment |100 |Name of Senator, State of Senator, Vote on Article I, Vote on | |

|Removal | | |Article II, Number of votes for guilt, Political party |senvote2.htm, |

| | | |affiliation, Degree of ideological conservativism, Percent of the |

| | | |vote Clinton received in 1996 in the Senator’s state, Year Senator|w.htm, . Data compiled for the |

| | | |is up for re-election, First-term Senator |Journal of Statistics Education by Alan Reifman. |

|UK Total Monthly Air Passengers, |Air Passengers |612 |Month, Year, Total number of monthly passengers | |

|1949-1999 | | | | |

|Width and Length of Fourth Grade |Kid’s Feet |39 |Birth month, Birth year, Length of longer foot, Width of longer |Meyer, M. C. (2006). Wider shoes for wider feet? Journal of |

|Students’ Feet | | |foot, Gender, Foot measured, Left- or right-handedness |Statistics Education, 14(1). Data collected by the author in|

| | | | |a fourth grade classroom in Ann Arbor, MI. |

|Wind Chill Factor: Windspeed and |Wind Chill |120 |Actual air temperature, Wind speed, Wind chill factor (Variables |National Weather Service; Museum of Science of Boston. |

|Temperature | | |presented in list and matrix format) |Quoted in Chatterjee, S., & Price, B. (1991). Regression |

| | | | |analysis by example (2nd ed.). New York: John Wiley. |

|Yearly Employment Rates in the |Percent Employed |20 |Year, Percent of males employed |The Condition of Education (1991). U.S. Department of |

|U.S. of 25- to 34-Year Old Males | | | |Education. Quoted in Chatterjee, S., Handcock, M. S., & |

|with 9-11 Years of Schooling | | | |Simonoff, J. S. (1995). A casebook for a first course in |

| | | | |statistics and data analysis. New York: John Wiley. |

|Yield (%) on British short term |Government Securities |240 |Year, Yield per month (January-December) | |

|government securities in | | | | |

|successive months | | | | |

|from about 1950 to about 1971 | | | | |

|Yields from Vineyard Harvest by |Harvest Yield |468 |Harvest year, Row of vines, Yield of grapes |Barnhill family archives, 1976-1991. Quoted in Chatterjee, |

|Row Number and Year of Harvest, | | | |S., Handcock, M. S., & Simonoff, J. S. (1995). A casebook |

|1983-1991 | | | |for a first course in statistics and data analysis. New |

| | | | |York: John Wiley. |

Sample Monte Carlo Simulation Program

data corr1(type= corr);

infile cards missover;

input _type_ $ _name_ $ v1-v6;

cards;

mean . 668.4 680.2 663.3 668.6 666.2 672.2

std . 39.1 48.8 39.1 37.9 37.6 48.1

n . 15000 15000 15000 15000 15000 15000

corr v1 1.00

corr v2 .91 1.00

corr v3 .96 .78 1.00

corr v4 .71 .65 .68 1.00

corr v5 .69 .64 .66 .95 1.00

corr v6 .64 .57 .62 .93 .77 1.00

;

run;

proc factor data=corr1 nfact=6 outstat=t1 noprint;

var v1-v6;

run;

title "Simulation Data for Classroom Models";

proc iml;

start sim1;

use work.t1;

read all var {v1 v2 v3 v4 v5 v6} into x12;

n=10000;

x11= {668.4 680.2 663.3 668.6 666.2 672.2};

xx12= {39.1 48.8 39.1 37.9 37.6 48.1};

g11= x12[13:18,]`;

a1= rannor(j(n, 6, 1));

a1_t= t(a1);

s_hat= g11*a1_t;

stand= t(s_hat);

m1= x11[1,1]; m2= x11[1,2]; m3= x11[1,3]; m4= x11[1,4]; m5= x11[1,5]; m6= x11[1,6];

s1= xx12[1,1]; s2= xx12[1,2]; s3= xx12[1,3]; s4= xx12[1,4]; s5= xx12[1,5]; s6= xx12[1,6];

col_g1= m1 + s1*stand[,1]; col_g2= m2 + s2*stand[,2]; col_g3= m3 + s3*stand[,3];

col_g4= m4 + s4*stand[,4]; col_g5= m5 + s5*stand[,5]; col_g6= m6 + s6*stand[,6];

n_data= col_g1||col_g2||col_g3||col_g4||col_g5||col_g6;

create sim1_data from n_data[colname= {x1 x2 x3 x4 x5 x6}];

append from n_data;

finish sim1;

run sim1;

data sample;

set sim1_data;

x1= round(x1, 1); x2= round(x2, 1); x3= round(x3, 1);

x4= round(x4, 1); x5= round(x5, 1); x6= round(x6, 1);

run;

proc corr data= sample;

run;

proc surveyselect data=sample sampsize= 200 out= temp1;

run;

Appendix A

Example Univariate Output for Arkansas Math.xls

The UNIVARIATE Procedure

Variable: s3MtScSc (Mathematics Scaled Score)

Moments

N 216 Sum Weights 216

Mean 223.013889 Sum Observations 48171

Std Deviation 88.260106 Variance 7789.84632

Skewness -0.3278101 Kurtosis -0.1406821

Uncorrected SS 12417619 Corrected SS 1674816.96

Coeff Variation 39.576058 Std Error Mean 6.00533957

Basic Statistical Measures

Location Variability

Mean 223.0139 Std Deviation 88.26011

Median 226.0000 Variance 7790

Mode 375.0000 Range 375.00000

Interquartile Range 115.50000

Tests for Location: Mu0=0

Test -Statistic- -----p Value------

Student's t t 37.13593 Pr > |t| = |M| = |S| |r| under H0: Rho=0

Total_ Total_

Scaled_ Open_ Multiple_

Score Ended Choice

Scaled_Score 1.00000 0.83332 0.80558

Scaled Score ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download