MatLaB: A programming environment for user-friendly and ...



A MANUAL

1D and 2D analysis with MatLab for CIV

by Boris Mirkin, DCSIS Birkbeck University of London (2010)

Contents

Introduction . . . . . . . . . . 2

Work with files . . . . . . . . . . 2

Loading and storing

ASCII files: numeric and string

Treating Excel files

Using subsets of entities and/or features

Categorising a quantitative feature

Feature characteristics . . . . . . . . 8

Distribution and histogram

Within group distributions

Mean (the average) and median (the middle of pre-sorted list)

Range and standard deviation

Analysis of numeric 2D distributions . . . . . . 10

Scatterplot; plot and subplot

Regression

Validating results with Bootstrapping . . . . . . 16

Bootstrapping the mean

Pivotal

Non-pivotal

Bootstrapping the linear regression

Appendix 1: Auxiliary programs . . . . . . 21

Appendix 2: Data set stud.dat and list stud.var . . . . 22

Appendix 3: Index of commands and concepts . . . . 23

Introduction

The working place within a processor’s memory is up to the user. A recommended option:

- a folder with user-made MatLab codes, called say Code and two or more subfolders, Data and Result, in which data and results, respectively, are to be stored by the user.

MatLab’s icon then is clicked on, after which MatLab opens as a three-part window, of which that on the right is working area referred to as Command Window, and the two parts on the left are auxiliary. MatLab can be brought in to the working directory with traditional MSDOS or UNIX based commands such as: cd in its Command Window. MatLab remembers then this path; and it is available to the user in a tiny window on top of the Command Window.

MatLab is organised as a set of packages, each in its own directory, consisting of program files with extension .m each. Character ‘%’ symbolises a comment for humans till the end of the line.

Help can be invoked Windows-wise or within the working area. In the latter, "help" command allows seeing names of the packages as well as of individual program files; the latter are operations that can be executed within MatLab. Example: Command “help” shows a bunch of packages, “matlab\datafun” among them; command “help datafun” displays a number of operations such as “max – largest component”; command “help max” explains the operation in detail.

Work with files

Loading and storing

ASCII files: numeric and string

A numeric data file should be organised as an entity-to-feature data table: rows correspond to entities, columns to features (see stud.dat and stud.var). Such a data structure, with all entries numerical, is referred to as a 2D array, corresponding to what is referred to as a matrix in mathematics; 1d arrays correspond to solitary entities or columns (features) or rows (entity records). Array is a most important MatLab data format to hold numeric data. It works on the principle of a chess-board: its (i,k)-th entry arr(i,k) is the element in i-th row and k-th column. An Excel file has a similar structure but it is interlaced with strings. A 2D array's defining feature is that every row has the same number of columns.

To load such a file one may use a command from package "iofun". A simple one is "load" to load a numeric array, organised as described, into the current MatLab processor memory:

>> arr=load('Data\stud.dat');

%symbol "%" is used for comments: MatLab interpreter doesn’t read lines beginning with “%”.

% "arr" is a place to put the data (variable);

% semicolon ";" should stand at the end of an instruction; if it does not, then the result will be printed % to the screen, which can be very useful for the user for checking the process of computation

% stud.dat is a 100x8 file of 100 part-time students with 8 features:

% 3 binary for Occupation, Age, NumberChildren, and scores over three disciplines. All feature

% names are in file stud.var stored in Data folder.

An 1D array can be put into the workspace with a command like

>>a=[3 4 7 0];

which is a 4(1 array, which can be transposed into a 1(4 array with a “transpose” command

>> b=a′

Since no semicolon is put in the end, b will be displayed on screen as

3

4

7

0

To get its 2d entry, a command

>> c=b(2)

can be utilised. Similarly, command

>> d=arr(7,8)

puts the value in arr’s 7th row and 8th column into workspace as variable d.

If a numeric array in working memory is to be stored, one may use MatLab command “save” which admits a number of storage formats including internal .mat format (see more with “help save”). To store array X into file Result\good.res in ASCII format (which is a text format covering characters in a standard keypad set), one may use command

>> save Result\good.res X –ascii

Commands

>> Var=importdata('Data\stud.var');

>> Ma=importdata('Data\stud.dat');

will put the list of variables into workspace as 8x1 cell Var, and the data into 100x8 array Ma.

To store Ma back as an ascii table in fale Maf.dat in Data folder, one may use command

>> dlmwrite('Data\Maf.dat', Ma ,'delimiter',' ','precision','%3.2f');

in which each item of four: file name ('Data\Maf.dat'), variable name (Ma), delimiter (' '), and precision ('%3.2f' – a real with 3 digits in the whole number part and 2 digits for decimals) – is changeable.

I put a couple of house-made string data handling programmes into Appendix 1 so that they could be put in the Code directory. There are update commands in MatLab, textread and textscan, that are much more versatile tools for importing a text.

If you need to check, before saving, what files and variables are currently in the workspace, you may use the upper-left part of the MatLab window or command

>> whos

that produces the list on the screen.

Names are handled as strings, with ' ' symbol. The entity/feature name sizes may vary – thus, cannot be handled in the array format.

To do this, another data format is used: the cell. Cells involve curly braces rather than round braces (parentheses) utilised for arrays. See the difference: arr(i) is 1D array arr’s i-th element, whereas brr{i} is cell brr’s i-th element, which can be not only a number, as in arrays, but also a string, an array, or even another cell.

There are other data structures as well in MatLab (video, audio, internet) which are not covered in this manual.

To load a sequence of names, one can use an in-house MatLab code readlist.m, supplied in the Appendix 1 for doing this. It should be copied as is into MatLab editor and stored as “readlist.m” file in the working folder, Code, in our example.

>> b=readlist('Data\stud.var');

% to copy list of names of stud.dat features from Data folder into cell b in the working memory;

Treating Excel files

MatLab supports several data formats, including those from popular tools such as Excel which is popular among scientists and practitioners alike (see more in help iofun). An Excel file with extension .xls can be dealt with in MatLab by using commands xlsread.m ans xlswrite. Straightforward as they are, the user should not expect a comfortable switch between Excel and MatLab with these commands.

Take a look, for example, onto Excel data file of several students in the table below.

|Feature |Age |#Children |Occup |ML_Mark |

|Student | |

| |

|John |35 |0 |IT |94 |

|Peggy |28 |2 |BA |67 |

|Fred |27 |1 |BA |85 |

|Chris |28 |0 |OT |48 |

|Liz |25 |0 |IT |87 |

| | | | | |

| | | | | |

Table 1. An Excel spreadsheet with data of five students over four features (Age in years, Number of Children, Occupation [Information Technology or Business Administration or Other], and Mark in MatLab over a range of 0-100%).

The xlsread command produces three data structures from an xls file: one for numeric part, the other for text part, and the third for all data in the file. Specifically, if the table above is stored in Data subfolder as student.xls file, this works as follows:

>> [nn,tt,rr]=xlsread(‘Data\student.xls’);

% nn is array of numeric values, tt – is cell of text, and rr is cell covering all the data in file

to produce a numeric 5(4 array nn:

35 0 NaN 94

28 2 NaN 67

nn = 27 1 NaN 85

28 0 NaN 48

25 0 NaN 87

and a text 8(5 cell tt:

'Feature' 'Age' '#Children' 'Occup' 'ML_Mark'

'Student' '' '' '' ''

'' '' '' '' ''

tt = 'John' '' '' 'IT' ''

'Peggy' '' '' 'BA' ''

'Fred' '' '' 'BA' ''

'Chris' '' '' 'OT' ''

'Liz' '' '' 'IT' ''

The full dataset is in 8(5 cell rr:

'Feature' 'Age' '#Children' 'Occup' 'ML_Mark'

'Student' [NaN] [ NaN] [ NaN] [ NaN]

[ NaN] [NaN] [ NaN] [ NaN] [ NaN]

rr = 'John' [ 35] [ 0] 'IT' [ 94]

'Peggy' [ 28] [ 2] 'BA' [ 67]

'Fred' [ 27] [ 1] 'BA' [ 85]

'Chris' [ 28] [ 0] 'OT' [ 48]

'Liz' [ 25] [ 0] 'IT' [ 87]

The NaN symbol applies in MatLab to undefined numeric values such as emerge from division by zero and the like.

As one can see these are not exactly clean-cut structures to work with. The numerical array nn contains an incomprehensible column of NaN values, and the text file tt mixes up names of students and features. We are going to deal with this in the next section.

Using subsets of entities and/or features

If one wants working with only three of the six features, say "Age", "Children" and “OOProgramming_Score", one must put together their indices into a named 1D array:

>> ii=[4 5 7]

% no semicolon in the end to display ii on screen as a row;

Then commands to reduce the dataset and the set of feature names ,over ii columns

>> newa=arr(:,ii); %new data array

>> newb=b(ii);

% newb is new feature set; note, to set it, one needs using round braces rather than curly ones,

% in spite of the fact that cells are involved here, not arrays

A similar command makes it to a subset of entities. If, for instance, we want to limit our attention to only those students who received 60 or more at "OOProgramming", we first find their indices with command "find":

>> jj=find(arr(:,7)>=60);

% jj is the set of the students defined in find()

% arr(:,7) is the seventh column of arr

Now we can apply "arr" to "jj":

>> al=arr(jj,:); % partial data of better off students

The size of the data file al can be found with command

>>size(al)

% note: no semicolon to see the size on the screen

to produce a screen output:

ans =

55. 8

meaning that al consists of 55 rows and 8 columns. If one needs to maintain these in the workspace, use command

>>[n,m]=size(al)

that will put 55 into n and 8 into m.

Now we are ready to discern meaningful data from numerical array nn and text cell tt in workspace for Table 1 obtained on p. 3. As shown on that page, array nn’s meaningless column is 3. Thus we can answer the question

Q1. Remove column 3 from nn.

Generally, a solution should be like this:

>> [rnn,cnn]=size(nn);

% thus, the number of columns is cnn

>> vv=setdiff([1:cnn],3);

% operation setdiff(x,y) removes from x all elements of array y occurring in x

% [1:cnn] is an array of all integers from 1 to cnn inclusive, so that [1:4] is [1 2 3 4]

% thus, vv consists of all indices but 3

>> nnr=nn(:,vv);

% this puts all nn, except for column 3, into nnr:

35 0 94

28 2 67

nnr = 27 1 85

28 0 48

25 0 87

Q2. Create a cell containing the corresponding feature set.

We need first to have a cell with all features. These constitute the final fragment of the first row of cell tt, without the very first string, “Feature”, as can be seen from the tt contents shown on page . Thus command

>> fe=tt(1,2:5);

% only contents of the first row in tt concerning its four columns, 2 to 5, goes to cell fe

leads to cell fe of size 1(4 containing of four features. To remove feature 3, we apply the array vv produced in the answer to Q1:

>>fer=fe(vv);

Cell fer contains strings 'Age' , '#Children', 'ML_Mark' indexed by 1, 2 and 3 and corresponding, in respect, to columns of array nnr.

Categorising a quantitative feature

Features can be categorical and quantitative. A quantitative feature can be categorised. Take, for example, "OOProgramming" score and postulate that its range should be divided in 4 categories:

1 - those whose mark is less than 40;

2 - the mark is between 40 and 50 inclusive;

3 - those whose mark is between 51 and 74 inclusive, and

4 - those with mark 75 or greater

Thus categorised variable is to be put as additional 9-th column in "a".

This can be done by defining an array of all boundary points:

>> bound=[0 39 50 75 101];

% note that the minimum and maximum marks are also included

with the following instruction in a loop “for”:

>> c=4; for k=1:c;list=find(arr(:,7)>bound(k) & arrr(:,7)> c=4; for k=1:c; ll{k}=find(arr(:,7)>bound(k) & arr(:,7) c=4; nc=9; for k=1:c; ll{k}=find(arr(:,7)>bound(k) & arr(:,7)> c=5; hh=hist(arr(:,7), c);

% produces counts in equally sized c=5 bins

>> c=4; hc=histc(arr(:,7),bound(1:c));

% produces counts within categories delineated in array "bound" above

Leaving command hist open will create a drawing of the distribution in the so-called histogram format (see Figure 1). To put histograms for both equal-bin categorisation and categories drawn in array, one may use the “subplot(k,m,n)” command that creates a k times m panel of boxes and puts the current drawing into n-th box counting row-wise from the top:

>> subplot(1,2,1);hist(arr(:,7));axis([30 100 0 20]);

>> subplot(1,2,2);hist(arr(:,7),bound);axis([0 100 0 50]);

The box frames are supplied in the end of each line to produce the drawings on Figure 1.

[pic]Figure 1. Two histograms for OOP marks feature: that on the left presenting counts in 10 equal sized bins supplied by default using command hist, and that on the right presenting four bins registered in array bound.

There is also a similar command “pie” for producing pie-charts of distributions.

Within-group distribution

[pic]

Figure 2: Distribution of students over four score groups within each occupation category.

Let us visualise distributions of SoftEngineering scores within student occupation groups, IT, BA and AN, that are coded by 1 in columns 1, 2 and 3, respectively, of the data array "arr".

>> for ii=1:3 %counting occupation categories

>> list=find(arr(:,ii)= =1);

>> hh(:,ii)=histc(arr(list,4),bound);

>> end

>> hh=hh(1:4,:); %Removal of the meaningless 5th line corresponding to bound(5)=101

Command

>> plot(1:4,hh); % 1:4 are for x-axis, hh for y-axis

produces a figure looking, after editing, as Figure 2, whereas command

>> bar(hh);

produces a figure looking, after editing, as Figure 3

Centre: Mean (the average) and Median (the middle of pre-sorted list)

For a 2D array, mean scores of all columns can be found in one go:

>> aver=mean(arr);

% This produces an 1(8 array of within-column means;

% means in columns 1, 2 and 3 corresponding to occupation categories are not meaningless:

% they express the proportions of categories IT, BA, and AN in the sample

Command

>> mmed=median(arr)

produces the median values in columns (note: after sorting!)

[pic]

Figure 3: Different distributions of score groups within each occupation category.

Spread: Range and Standard deviation

Range is the difference between maximum and minimum values of a variable. It can be computed for all columns simultaneously:

>> bmin=min(arr); % within-column minimum values

>> bmax=max(arr); % within-column maximum values

>> range=bmax-bmin; % a row of within-column ranges

Standard deviation is the quadratic root of the variance, which is the average square of deviations of the feature values from the mean. It also can be computed column-wise:

>> sdev=std(a);

% or sdev=std(a,1) if you want N in the denominator of the variance

Analysis of 2D distributions (numeric case)

Bivariate distributions show the relationship between two features. There can be different cases depending on the types of measurement scale: (a) both features are quantitative, (b) both features are categorical, (c) one is categorical and the other quantitative, of which only (a) is presented here.

Scatterplot

Plot and subplot

Given a numeric data table, like "arr" the student-to-feature 100 x 8 array, take any two quantitative features of interest and plot entities as points on the plane formed by the features. For instance, take "Age" as x and "Computational Intelligence" score as y:

>> x=arr(:,4); % Age is 4-th column of array "arr"

>> y=arr(:,8); % CI score is in 8-th column of "arr"

Then student 1 (first row) will be represented by a point with coordinates x=28 and y=90 to represent the student’s age and CI score, respectively.

A display referring to all entities as points on a coordinate plane is referred to as a scatterplot. To make one, use command:

>> plot(x,y,'k.')

% k refers to black colour, “.” dot graphics; 'mp' would stand for magenta pentagram; see others by

% using "help plot"

Unfortunately, this gives a very tight presentation: some points are on the borders. To make borders stretched out, one needs to change the axes:

>> d=axis;

>> axis(1.2*d-10);

This transformation is presented on the right of Figure 4. To make both plots presented on the same figure, use "subplot" command:

>> subplot(1,2,1)

>> plot(x,y,'k.');

>> subplot(1,2,2)

>> plot(x,y,'k.');

>> d=axis;

>> axis(1.2*d-10);

Command subplot(1,2,1) creates one row consisting of two windows for plots and puts the follow-up plot into the 1st window (that on the left).

Command subplot(1,3,2) would create a set of 6 plots, 3 times 2, and put the current plot into the first of them.

A more explicit setting for the frame can be achieved with a command

>>axis([a b c d]);

that sets the x-axis box frame from a to b, and that of y-axis from c to d.

Whichever presentation is taken, no regularity in relation between the two variables can be seen at all. Let's try to see whether anything better can be seen for different occupations. To do this, one needs to handle entity sets for them separately:

[pic]

Figure 4: Scatterplot of features “Age” and “CI score”; the display on the right is a rescaled version of that on the left.

>> o1=find(a(:,1)==1); % set of indices for IT

>> o2=find(a(:,2)==1); % set of indices for BA

>> o3=find(a(:,3)==1); % set of indices for AN

>> x1=x(o1);y1=y(o1); % the features x and y at IT students

>> x2=x(o2);y2=y(o2); % the features at BA students

>> x3=x(o3);y3=y(o3); % the features at AN students

Now we are in a position to put, first, all the three together, and then each of these three separately (again with the command "subplot", but this time with four windows, see Figure 5).

>> subplot(2,2,1);

>> plot(x1,y1, '*b',x2,y2,'pm',x3,y3,'.k');% all the three plotted

>> d=axis; axis(1.2*d-10);

>> subplot(2,2,2);

>> plot(x1,y1, '*b'); % IT plotted with blue stars

>> d=axis; axis(1.2*d-10);

>> subplot(2,2,3);

>> plot(x2,y2,'pm'); % BA plotted with magenta pentagrams

>> d=axis; axis(1.2*d-10);

>> subplot(2,2,4);

>> plot(x3,y3,'.k'); % AN plotted with black dots

>> d=axis; axis(1.2*d-10);

Of the three occupation groups, Figure 4 some potential in AN group only: it is likely that "the greater the age the lower the mark" in this group (black dots in the Figure 4’s bottom right). To check this, let us utilise the linear regression.

[pic]

Figure 5: Joint and individual displays of the scatter-plots for the occupation categories (IT blue, BA magenta, AN black).

There is also command “scatter” to allow variable sizes and/or colours of the graphic symbols representing individual entities (see “help scatter”).

Plotmatrix

[pic]

Figure 6. Marks on SE, OOP and CI (from top to bottom) against Age, on the left, and Number of children, on the right.

The operation of plotmatrix allows to see scatterplots of a set of variables against another, or the same, set of variables defined on the same entities. Consider, for example, relation between demographics (age and number of children) and marks on three subjects according to studn.dat file that is held in workspace as arr array. Define corresponding matrices

>> x=arr(:,4:5); % demographics are in columns 4 and 5 of arr array

>> y=arr(:,6:8); % marks are in columns 6,7 and 8 of arr array

and set

>> plotmatrix(x,y,’k*’);

to produce a box presented on Figure 6.

The plots show no relation between demographics and marks. We also can take a look at interrelation between marks in different subjects with

>> figure(2); plotmatrix(y,y,’k*’); % this will be mapped into a different Figure window

leading to a set of plots presented on Figure 7. This set shows no relation between SE marks and marks in other subjects, and a possible positive relation between marks in OOP and CI. These relations can be quantitatively evaluated by computing pairwise correlation coefficients with command:

>>corrcoef(y)

to display

ans =

1.0000 0.0716 -0.0286

0.0716 1.0000 0.3108

-0.0286 0.3108 1.0000

showing almost zero correlation between SE and the others, while a positive, however small, correlation of .31 between OOP and CI. An explanation of the meaning of the coefficient can be found on the next page.

Regression

Regression is a technique invented by F. Galton and K. Pearson to see whether the relationship between x and y can be expressed as a "linear function" (a straight line on the plot), y = slope*x + intercept where slope and intercept are constants, the former expressing the change in y when x is added by 1 and the latter the level of y at x=0. The best possible (that is, minimising the average square difference between real y's and those found as slope*x+intercept) values of slope and intercept are expressed as follows:

>> slope = rho*std(y)/std(x); intercept = mean(y) - slope*mean(x);

Here "rho" is the so-called (Pearson) correlation coefficient between x and y that can be determined with MatLab operation "corrcoef".

[pic]

Figure 7. Marks in SE, OOP and CI (from top to bottom) against marks on the same subjects, from left to right.

The correlation coefficient has several interpretations depending on the framework in which the data of x and y are considered. Classical mathematic statistics view is that the observations are independently generated according to a two-dimensional probabilistic distribution. A special case is when the distribution is Gaussian, which means that its density function has a skewed bell shape, so that its indifference (the same level) curves are ellipses. Then “rho” expresses the direction of the principal axis of the ellipses. For another distribution shape, the correlation coefficient is meaningful only to the extent at which the distribution resembles a Gaussian one. The author adheres to another view, having nothing to do with Gaussians, is expressed in newer paradigms of data analysis and data mining. This view relates “rho” with the extent at which relation between x and y can be expressed with a linear equation. Then the square “rho2” expresses the proportion of y’s scatter which is taken into account by such a linear equation – the closer to 1 the better, while the sign of “rho” shows the direction of the relation – positive or negative.

Since we are interested in group AN only, we take AN-related values only, x3 and y3:

>> cc=corrcoef(x3,y3)

leading to table cc =

1.0000 -0.7082

-0.7082 1.0000

in which rho is the non-diagonal quantity that can be caught up with

>> rho=cc(1,2);

Then the general formula applies to the pair (x3,y3):

>> slope = rho*std(y3)/std(x3); % this produces slope =-1.33;

>> intercept = mean(y3) - slope*mean(x3); % this produces intercept = 98.2;

thus leading to the linear regression y3= 98.2 - 1.33*x3 expressing the idea that every year added to the age, in general, decreases the mark by 1.33, so that aging by 3 years would lead to the loss of 4 marks.

To check whether the equation is good, one may compare the real values for students number 81, 82, and 83, all belonging to AN with those derived using the equation:

>> ii=[80 81 82];

>> x(ii); %the ages

>> y(ii); %the marks

>> yy=slope*x(ii)+intercept; % the marks derived from the age

thus yielding the following data:

x y yy

24 62 66.3

34 30 53.0

41 39 43.7

One can see that the error for the second student, 53 (predicted) – 30 (real) = 23, is rather high, which reflects the fact that the mark of this student contradicts the general regularity: s/he is younger than the third student but her/his mark is lower.

Altogether, the regression equation explains rho2=0.50=50% of the total variance of y3. The rho squared is referred to as the coefficient of determination, in mathematics statistics.

Validating results with Bootstrapping

Bootstrapping is a popular computational experiment technique to validate data analysis results. The computational power allows for experimentation on the spot, with real data, rather than with theoretical probabilistic distributions, not necessarily adequate to the data, as in the conventional approach of mathematical statistics. (Those interested to learn more should consult J. Carpenter and J. Bithell (2000) Bootstrap confidence intervals: when, which, what?, Statistics in Medicine, 2000, 19, 1141-1164.)

Bootstrapping is based on a pre-specified number of random trials. Each trial begins with randomly selecting an entity as many times as there are in the dataset, with replacement, so that the same entity can be selected several times whereas some other entities may be never selected in a trial. (In fact, in each trial, on average only 62% entities get selected into the sample.) In the case of the data of AN students, the sample consists of N=31 entities because this is the number of students in the set under consideration. In general, if the set contains N entities, a trial sample is created of N randomly selected – possibly with repetitions – entities from the set. Thus selected, set of N students (some of them, as explained, may coincide) is assigned with the corresponding data values from the original data table so that coinciding students get identical feature values.

To do this in MatLab, the following operation is applied:

>> ra=ceil(n*rand(n,1));

% rand(n,1) is 1D array of n pseudorandom numbers drawn by computer between 0 and 1; multiplied % by n, these numbers are between 0 and n, exclusive. The operation ceil(aa) rounds aa to an integer % nearest to aa from above. Thus ra is array of pseudorandom integers between 1 and n inclusive – a % bootstrap trial sample.

>> xr=x(ra);

% this leads to array xr of length n translating values of x to corresponding elements in ra. It is

% assumed that x is the n-element data set under investigation.

Then a data analysis method under consideration applies to this data sample to produce the trial result. After a number of such trials, say a thousand or five thousand, the user gets enough data to see how well they correspond to the original results, especially with respect to their standard deviation from them.

In the remainder we apply this approach to estimate the validity of the mean OOP score, mean(arr(:,7)) = 61.65 over the entire data set, and the linear regression of CI score over Age in AN occupation group.

Bootstrapping the mean

Consider array

>>oop=arr(:,7); % OOP score 1D array of length 100

Its mean mo=61.65. The question is how much one should trust this? The classical statistics thinking suggests considering the array as a set of independent randomly drawn values from the same distribution, then the standard deviation has something to do with its validity, which is quite easy to compute with MatLab:

>>so=std(oop);

This gives us the standard deviation value so=16.52. In the case that the distribution is Gaussian, that is, bell shaped, that would give us an estimate mo ( 1.96(so/10= 61.65 ( 3.24, which is interval from 58.41 to 64.89, so that the “real” mean should fall in the interval with the 95% confidence. We divide “so” by 10, which is the square root of the number of observations 100, according to a mathematic statistics formula for the sampling estimate of the standard deviation. The factor f=1.96 comes from the theory of Gaussian distributions to guarantee that 95% of the population do fall in the interval mo ( f(so/10. However, the distribution is hardly Gaussian, as can be clearly seen from rather irregular histogram shape on the left in Figure 1 (histogram on the right is irrelevant – why?). Is there any estimate that can be convincingly drawn in such a case?

Let us draw 5000 bootstrap trials for set o:

>> n=100;for k=1:5000; ra=ceil(n*rand(n,1));or=o(ra); mr(k)=mean(or);end;

Here, n=100 is the size of o, and mr is the 5000-strong array of the means of 5000 trials, which are explained above. The histogram of mr, say in 40 bins, can be drawn with command

>>hist(mr, 40);

to produce the drawing in Figure 8.

[pic]

Figure 8. The 40-bin histogram for the bootstrap 5000 means over OOP score.

The distribution of bootstrap means in Figure 8 looks fairly bell-shaped. Thus the confidence interval for mo can be estimated by using bootstrap results in two ways:

A. Pivotal method

Here we utilise the estimates for mo and so from the bootstrap distribution:

>> mb=mean(mr); sb=std(mr);

which leads to mb=61.67 and sb=1.66 (rounded to two decimal digits). Using these, we find the 95% confidence interval with the Gaussian-based formula mb ( 1.96(sb= 61.67 ( 1.96(1.66 = [58.42, 64.92], which is just the Gaussian result shifted to the right by 0.01.

B. Non-pivotal method

Here we use the mr distribution as is and take the boundaries of the interval containing 95% of the bootstrap trial values. To do this, we take, as usual, 2.5% of 5000 values from each end of the distribution. That means that we need to sort array mr in the ascending order and remove 125 values from each end so that 126th and 4875th values in the sorted array will be the boundaries of a bootstrap 95% confidence interval.

Sorting in the ascending order is done in MatLab with “sort” command:

>>ms=sort(mr);

after which ms(126) = 58.48 and ms(4875)=64.95 are easily produced. The interval found, [58.48, 64.95] shifts the Gaussian estimate for the confidence interval to the right by about 0.06.

The author prefers the non-pivotal estimate as based purely on computation and more intuitive.

Bootstrapping linear regression

To do a trial with randomly selecting N=31 students from AN occupation category, one can use the following MatLab command:

>> N=31;ra=ceil(N*rand(N,1));

Command rand(N,1) produces a column of N=31 random real numbers, between 0 and 1 each. Multiplying this by N=31 stretches the numbers to be between 0 and N, and "ceil" rounds them up to integers.

Then corresponding values of x and y can be found by using equations:

>>xr=x3(ra);yr=y3(ra);

after which formulas above apply to compute the rho, slope and intercept.

To do this a number (5000, in this case) of times, one runs a loop:

>> n=31;for k=1:5000; ra=ceil(n*rand(n,1));

xr=x3(ra);yr=y3(ra);

cc=corrcoef(xr,yr);rhr(k)=cc(1,2);

sl(k)=rhr(k)*std(yr)/std(xr);

in(k)=mean(yr)-sl(k)*mean(xr);

end

% a "for" loop runs with counter k through every single integer from 1 to 5000 inclusive;

% ra represents a bootstrap trial sample, with xr, yr being corresponding variables’ values;

% the results are stored in 5000-strong columns

% rhr (correlation coefficients), sl (slopes) and in (intercepts)

Now we can check the mean and standard deviation of the obtained distributions. Commands

>>mean(sl)

>>std(sl)

produce values -1.33 and 0.24, which fits the previous result well but supplements it with a value of the standard deviation. Similarly mean and std values for the intercept and rho are produced. They are, respectively, 98.2 / 9.0 and -0.704 / 0.095. These are sufficient to derive a confidence interval for each of the parameters with the pivotal method using the Gaussian formula analogous to that used for validating the mean in the previous section, which is left to the reader.

Figure 9 presents the histograms of the bootstrap slope and intercept values as rather bell-shaped indeed. They are obtained with:

>> subplot(1,2,1); hist(sl,30);

>> subplot(1,2,2); hist(in,30);

To use the non-pivotal method for deriving bootstrap based confidence intervals for the slope and intercept, one needs to sort these in the ascending order:

>>sls=sort(sl);ins=sort(in);

Consider the 95% confidence interval with boundaries leaving out 125 entries from each side:

>> disp([sls(126) sls(4875)];

>> disp([ins(126) ins(4875)];

to print on screen -1.7962 -0.8645 and 80.6003 115.7490, respectively.

[pic]

Figure 9: 30-bin histograms of the slope (left) and intercept (right) after 5000 bootstrapping trials.

This all can be put into the same Figure 10:

by, first, defining the three regression lines with

>> y3reg=slope*x3+intercept;

>> y3regleft=slleftbound*x3+inleftbound;

>> y3regright=slrightbound*x3+inrightbound;

and then plotting the four sets onto the same figure:

>> plot(x3,y3,'*k',x3,y3reg,'k',x3,y3regleft,'r',x3,y3regright,'r')

% x3,y3,'*k' presents student data as black stars

% x3,y3reg,'k' presents the real regression line in black

% x3,y3regleft,'g' and x3,y3regright,'g' present the boundary regressions with red lines

The red lines on Figure 10 show the limits of the regression line for 95% of trials as estimated using the non-pivotal bootstrap based method.

[pic]

Figure 10: Regression of CI score over Age (black line) within occupation category AN with boundaries covering 95% of potential biases due to sample fluctuations.

Appendix 1

Code of readlist.m

%reading names as a cell structure, E, into workspace

function E=readlist(Filename)

read=1;

i=0;

fid= fopen(Filename, 'rt');

if fid~=-1

while ~(feof(fid)==1)

a=fgetl(fid);

if (strncmp('%',a,1)==0 & isempty(a)==0),

i=i+1;

E{i}=a;

else

read=0;

end;

end;

fclose(fid);

else

disp('The file has not been opened');

end;

return

Code of savelist.m

% Saves a cell list to file Data\fle.txt where “fle” is the user given name of prefix of the file

function savelist(fle, list);

file=['Data\' fle '.txt'];

fid=fopen(file,'w');

if fid~=-1

[n1,n2]=size(list);

for j=1:n2

fprintf(fid, '%s\n', list{j});

end;

fprintf(fid, '\n');

fclose(fid);

end;

return

Appendix 2. Data set stud.dat (no separation space between first three columns) and stud.var

12 3 4 5 6 7 8 123 4 5 6 7 8 Column number

100 28 0 41 66 90 010 51 2 75 73 57

100 35 0 57 56 60 010 44 3 53 43 60

100 25 0 61 72 79 010 49 3 86 39 62

100 29 1 69 73 72 010 27 2 93 58 62

100 39 0 63 52 88 010 30 1 75 74 70

100 34 0 62 83 80 010 47 0 46 36 36

100 24 0 53 86 60 010 38 2 86 70 47

100 37 1 59 65 69 010 49 1 76 36 66

100 33 1 64 64 58 010 45 0 80 56 47

100 23 1 43 85 90 010 44 2 50 43 72

100 24 1 68 89 65 010 36 3 66 64 62

100 32 0 67 98 53 010 31 2 64 45 38

100 33 0 58 74 81 010 31 3 53 72 38

100 27 1 48 94 87 010 32 3 87 40 35

100 32 1 66 73 62 010 38 0 87 56 44

100 29 0 55 90 61 010 48 1 68 71 56

100 21 0 62 91 88 010 39 2 93 73 53

100 21 0 53 59 56 010 47 1 52 48 63

100 26 1 69 70 89 010 39 2 88 52 58

100 20 1 42 76 79 001 23 0 54 50 40

100 28 1 57 85 85 001 34 0 46 33 59

100 34 1 49 78 59 001 33 0 51 38 67

100 22 0 66 73 69 001 31 0 59 45 68

100 21 1 50 72 54 001 25 0 51 41 74

100 32 1 60 55 85 001 40 0 41 61 45

100 32 0 42 72 73 001 41 0 44 43 53

100 20 1 51 69 64 001 42 0 40 56 50

100 20 1 55 66 66 001 34 0 47 69 49

100 24 1 53 92 86 001 37 0 45 50 43

100 32 0 57 87 66 001 24 0 47 68 62

100 21 1 58 97 54 001 34 0 50 63 30

100 27 1 43 78 59 001 41 0 37 67 39

100 33 0 67 52 53 001 47 1 43 35 36

100 34 1 63 80 74 001 28 0 50 62 66

100 34 0 64 90 56 001 28 0 39 66 82

010 36 2 86 54 68 001 46 0 51 36 30

010 35 2 79 72 60 001 27 0 41 35 72

010 36 1 55 44 57 001 44 0 50 61 50

010 37 1 59 69 45 001 47 0 48 59 48

010 42 2 76 61 68 001 27 0 47 56 47

010 30 3 72 71 46 001 27 0 49 60 65

010 28 1 48 55 65 001 21 0 59 57 74

010 38 1 49 75 61 001 22 0 44 65 54

010 49 2 59 50 44 001 39 0 45 41 34

010 50 2 65 56 59 001 26 0 43 47 80

010 34 2 69 42 59 001 45 1 45 39 35

010 31 2 90 55 61 001 25 0 42 31 71

010 49 3 75 52 42 001 25 0 45 33 66

010 33 1 61 61 60 001 50 1 48 64 20

010 43 0 69 62 42 001 33 0 53 44 45

Appendix 3

Index of commands and concepts explained in the manual

Concept Page

Array 2

Axis 11

Bar 9

Bootstrap 16

--, pivotal 18

--, non-pivotal 18

--, trial 16

Categorisation 7

Ceil 17

Cell 4

Comment 2

Correlation coefficient 14

Distribution 8

Dlmwrite 3

Find 6

Gaussian distribution 17

Help 2

Hist 7

Histc 8

Histogram 8

Importdata 3

Intercept 14

Load 2

Loop 7

Mean 9

Median 9

Path 2

Plot 8, 11

Plotmatrix 13

Range 10

Reading Excel files 4

Regression 14

Save 3

Scatterplot 11

Semicolon 2

Setdiff 6

Size 6

Slope 14

Sort 18

Standard deviation 10

String 4

Subplot 8, 11

Subset 5

Transpose 3

Variance 10

Whos 4

-----------------------

Features in dataset stud.dat

[pic]

*/2347CEFiklnpqrstüøôøüîèâüÓÄ°¡’¡ƒ¡Ät¡eVGhw’5?CJOJ[?]QJ[?]\?^J[?]hw’5?CJOJQJ\?^Jh8$Ñ5?CJOJQJ\?^Jh:Vš5?CJOJQJ\?^Jhxä5?CJOJQJ\?^JhÀ

ñ5?CJOJQJ\?^Jh‚ ü5?CJOJQJ\?^J&hxähÁGE5?CJOJQJ\?^JaJhÁGE5?CJOJQJ\?^Jh nU5?CJOJQJ\?^Jh:VšCJ((corresponding to columns)

1. Occup:IT

2. Occup:BA

3. Occup:AN

4. Age

5. Children

6. SoftEngineering

7. OOProgramming

8. CompIntelligence

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download