MatLaB: A programming environment for user-friendly and ...





DIY with MatLab

10 and 17 November 2006

Login csish50 + your number in the register (Example: csish63 if number =13)

password N03atinLab

MatLab: A programming environment for user-friendly and fast manipulation and analysis of data

All codes and data referred to in this guide can be found at

Contents

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Work with files: array and cell structures . . . . . . . . . . . . . . . . 2

Feature characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Within group distributions

Mean (the average) and median (the middle of pre-sorted list)

Range and standard deviation

Analysis of 2D distributions . . . . . . . . . . . . . . . . . . . . . . . . 6

Scatterplot; plot and subplot

Regression

Validating results with Bootstrapping . . . . . . . . . . . . . . . . 9

Principal Component Analysis PCA . . . . . . . . . . . . . . . . 11

Introduction

The working place within a processor’s memory is up to the user. A recommended option:

- a directory with user-made MatLab codes, say Codes and two or more subdirectories, Data and Results, in which data and results are stored, respectively.

MatLab is then can be brought up to the working directory with traditional MSDOS or UNIX based commands such as: cd . MatLab remembers then this path.

MatLab is organised as a set of packages, each in its own directory, consisting of program files with extension .m each. A few data handling programmes are in the DIY06 directory.

"Help" command allows seeing names of the packages as well as of individual program files; the latter are operations that can be executed within MatLab. Example: Command “help” shows a bunch of packages, “matlab\datafun” among them; command “help datafun” displays a number of operations such as “max – taking largest component (column-wise)”; command “help max” explains the operation in detail.

Work with data files: array and cell

A data file should be organised as an entity-to-feature data table: rows correspond to entities, columns to features (see stud.dat and stud.var). Such a data structure is referred to as a 2D array or matrix; 1d arrays correspond to solitary entities or features. This is one of MatLab data formats. The array format works on the principle of a chess-board: its (i,k)-th element is the element in i-th row k-th column (A convention: row always goes first!). An array's defining property is that every row has the same number of columns.

To load such a file one may use a command from package "iofun". A simple one is "load":

>> a=load('Data\stud.dat');

% symbol "%" is used for comments: MatLab interpreter doesn’t read lines beginning with “%”.

% "a" is a place to put the data (variable); ";" should stand at the end of an instruction;

% stud.dat is a 100 x 8 file of 100 part-time students with 8 features:

% 3 for Occupation (IT,BA, AN), Age, NChildren, and scores over three disciplines (in file stud.var)

Names are handled as strings, with ' ' symbol (no “space” in a string permitted). The entity/feature name sizes may vary, thus cannot be handled in the array format.

To do this, another data format is used: the cell. Round braces (parentheses) are used for arrays, curly braces for cells: a(i,:) - array's a i-th row, b{i} -cell's b i-th element, which can be a string, a number, an array, or a cell, which allows quite complex data structures. There can be other data structures as well (video, audio,...).

>> b=readlist('Data\stud.var'); % list of names of stud.dat features

Note: “load” is a MatLab command, “readlist” not- it is an .m file, a MatLab code which is not part of the MatLab proper.

Subsets of data

If one wants working with only three of the six features, say "Age", "Children" and “OOProgramming_Score", one must put together their indices into a named 1d array:

>> ii=[4 5 7]

% no semicolon in the end to display ii on screen as a row; to make ii a column, semicolons are used

>> newa=a(:,ii); %new data array

>> newb=b(ii); %new feature set

A similar command makes it to a subset of entities. If, for instance, we want to limit our attention to only those students who received 60 or more at "OOProgramming", we first find their indices with command "find":

>> jj=find(a(:,7)>=60);

% jj is the set of the students defined in find()

% a(:,7) is the seventh column of a

Now we can apply "a" to "ii":

>> al=a(jj,:); % partial data of better of students

Saving

To save a data file obtained within MatLab, into a file in Data directory, savemat.m in Diy06 directory can be applied:

>> savemat(‘newstud’,al); % the file is saved as newstud.dat

There is a Matlab .m format that allows you to store files in that; it can be used within MatLab only (see help save).

Categorical features

Features can be categorical and quantitative. A quantitative feature can be categorised. Take, for example, "OOProgramming" score and postulate that its range should be divided into 4 categories:

1 - those whose mark is less than 40;

2 - the mark is between 40 and 50 inclusive;

3 - those whose mark is between 51 and 74 inclusive, and

4 - those with mark 75 or greater

Thus categorised variable is to be put as additional 9-th column in "a". This can be done by using an array of all boundary points:

>> bound=[0 39 50 75 101];

% note that the minimum and maximum marks are also included

Followed by the following instruction:

>> for k=1:4;list=find(a(:,7)>bound(k) & a(:,7)> for k=1:4;ll{k}=find(a(:,7)>bound(k) & a(:,7)> for k=1:4;ll{k}=find(a(:,7)>bound(k) & a(:,7)> hh=hist(a(:,7),4);

% produces counts of feature in 7th column in 4 equally sized bins

>> hc=histc(a(:,7),bound(1:4));

% produces counts within categories delineated in array "bound" defined above

Within-group distributions

[pic]

Figure 1: Distribution of students over four score groups within each occupation category.

Let us visualise distributions of SoftEngineering score within student occupation groups, IT, BA and AN, that are coded by 1 in columns 1, 2 and 3, respectively, of the data array "a".

>> for ii=1:3

list=find(a(:,ii)= =1);

hh(:,ii)=histc(a(list,4),bound);

end

>> hh=hh(1:4,:); %Removal of the meaningless line corresponding to bound(5)=101

Command

>> plot(1:4,hh);

produces a figure looking, after editing, as Figure 1, whereas command

>> bar(hh);

produces a figure looking, after editing, as Figure 2

[pic]

Figure 2: Different distributions of score groups within each occupation category

Mean (the average) and Median (the middle of pre-sorted list)

Mean scores of all columns:

>> aver=mean(a)

% Means in columns 1, 2 and 3 express proportions of the categories IT, BA, and AN in the sample

Command

>> mmed=median(a);

produces the median values in columns (note: after sorting!)

Range and Standard deviation

>> bmin=min(a); % within-column minimum values

>> bmax=max(a); % within-column maximum values

>> range=bmax-bmin; % within-column ranges

Standard deviation, which is the quadratic root of the variance, which itself is the average square of deviations from the mean:

>> sdev=std(a);

% or sdev=std(a,1) if you want N, not N-1, in the denominator of the variance

Analysis of 2D distributions

Two-variate distributions show the relationship between two features. There can be different cases depending on the types of scale: (a) both features are quantitative, (b) both features are categorical, (c) one is categorical and the other quantitative, of which only (a) is presented here.

Scatterplot; plot and subplot

Given a data table, like "a" the student-to-feature 100 x 8 array, take any two features of interest and plot entities as points on the plane formed by the features. For instance take "Age" as x and "Computational Intelligence" score as y:

>> x=a(:,4); % Age is 4-th column of array "a"

>> y=a(:,8); % CI score is in 8-th column of "a"

Then student 1 (first row) will be presented by point with coordinates x=28 and y=90 as those representing her/his age and CI score, respectively.

To plot all, use command:

>> plot(x,y,'k.')

% k refers to black colour, “.” dot graphics; 'mp' stands for magenta pentagram; see others by using % "help plot"

Unfortunately, this gives a very tight presentation: some points are on the borders. To make borders stretched out, one needs to change the axes:

>> d=axis;

>> axis(1.2*d-10);

This transformation is presented in Figure 3 on the right. To make both plots presented on the same figure, use "subplot" command:

>> subplot(1,2,1)

>> plot(x,y,'k.');

>> subplot(1,2,2)

>> plot(x,y,'k.');

>> d=axis;

>> axis(1.2*d-10);

Command subplot(1,2,1) creates one row consisting of two windows for plots and puts the follow-up plot into the 1st window (that on the left).

Whichever presentation is taken, no relation between the features can be seen at all. Let's try to see whether anything better can be found at different occupations. To visualise this, one needs to handle entity sets for different conditions separately:

[pic]

Figure 3: Scatter plot of features “Age” and “CI score”; the display on the right is a rescaled version of that on the left.

>> o1=find(a(:,1)==1); % set of indices for students in IT

>> o2=find(a(:,2)==1); % set of indices for students in BA

>> o3=find(a(:,3)==1); % set of indices for students in AN

>> x1=x(o1);y1=y(o1); % the features x and y at IT students

>> x2=x(o2);y2=y(o2); % the features at BA students

>> x3=x(o3);y3=y(o3); % the features at AN students

Now we are in a position to put, first, all the three together, and then each of these three separately (again with the command "subplot", but this time with four windows, see Figure 4).

>> subplot(2,2,1);

>> plot(x1,y1, '*b',x2,y2,'pm',x3,y3,'.k');% all the three plotted

>> d=axis;

>> axis(1.2*d-10);

>> subplot(2,2,2);

>> plot(x1,y1, '*b'); % IT plotted with blue stars

>> d=axis;

>> axis(1.2*d-10);

>> subplot(2,2,3);

>> plot(x2,y2,'pm'); % BA plotted with magenta pentagrams

>> d=axis;

>> axis(1.2*d-10);

>> subplot(2,2,4);

>> plot(x3,y3,'.k'); % AN plotted with black dots

>> d=axis;

>> axis(1.2*d-10);

Of the three occupation groups, I can see some potential in AN group only: it is likely that "the greater the age the lower the mark" in this group (black dots in the Figure 4’s bottom right). To check this, let us utilise the linear regression.

[pic]

Figure 4: Joint and individual displays of the scatter-plots for the occupation categories (IT blue, BA magenta, AN black).

Regression

Regression is a technique invented by F. Galton and K. Pearson of UCL to see whether the relationship between x and y can be expressed as a "linear function" (a straight line on the plot), y = slope*x + intercept where slope and intercept are constants, the former expressing the change in y when x is added by 1 and the latter the level of y at x=0. It is proven that the best possible (that is, minimising the average square difference between real y's and those found using expression slope*x + intercept) values of slope and intercept can be expressed as follows:

>> slope = rho*std(y)/std(x); intercept = mean(y) - slope*mean(x);

Here "rho" is the so-called (Pearson) correlation coefficient between x and y that can be determined with MatLab operation "corrcoef". Since we are interested in group AN only, we take AN-related feature values only, that is, arrays x3 and y3:

>> cc=corrcoef(x3,y3)

leading to table

cc = 1.0000 -0.7082

-0.7082 1.0000

in which any non-diagonal quantity is the rho, which can be caught up with

>> rho=cc(1,2);

Then the general formula applies to the pair (x3,y3):

>> slope = rho*std(y3)/std(x3); % this produces slope =-1.33;

>> intercept = mean(y3) - slope*mean(x3); % this produces intercept = 98.2;

thus leading to the linear regression y3= 98.2 - 1.33*x3 . Interpretation: every year added to the age, in general, decreases the mark by 1.33, so that aging by 3 years would lead to the loss of 4 marks (plus/minus possible hair loss).

To check by yourself whether the equation is good indeed, one may compare the real values for students number 81, 82, and 83, all belonging to AN with those derived using the equation:

>> ii=[80 81 82];

>> x(ii); %the ages

>> y(ii); %the marks in data

>> yy=slope*x(ii)+intercept; % the marks derived from the age

thus yielding the following data:

x y yy

24 62 66.3

34 30 53.0

41 39 43.7

One can see that the error for the second student, 53 (predicted) – 30 (real) = 23, is rather high, which reflects the fact that the mark of this student contradicts the general regularity: s/he is younger than the third student but her/his mark is lower. (The comment above can be considered an example in interpreting bad results.)

Altogether, the regression equation explains rho^2=0.50=50% of the total variance of y3, which is typical in social sciences. (Index rho^2 is referred to as determination coefficient in statistics.)

Validating results with Bootstrapping

Bootstrapping is a popular computational experiment technique to validate data analysis results, especially in a situation in which there are not many observations and the mechanism of the phenomenon under study is unknown. The computational power allows for experimentation on the spot, with real data, rather than with some theoretical probabilistic distributions as in the conventional approach of mathematical statistics. This can be convenient in a situation in which the theoretical distribution is not necessarily adequate to the data.

Bootstrapping is based on a pre-specified number of random trials. In the case of the data of a hundred students, each trial begins with randomly selecting a student a hundred times, without removal, so that the same entity can be selected several times whereas some other entities may be never selected in a trial. (In fact, in each trial, on average only 62% entities get selected into the sample.) Thus selected, set of a hundred students (some of them, as explained, coincide) is assigned with the corresponding data values from the original data table so that coinciding students get identical feature values. Then a data analysis method under consideration, currently "linear regression", applies to this data to produce the trial result. After a number of such trials, the user gets enough data to see how well they correspond to the original results, especially with respect to their standard deviation from them.

To do a trial with randomly selecting 31 students from AN occupation category, without removal, one can use the following MatLab command:

>> ra=ceil(31*rand(31,1));

% rand(31,1) produces a column of 31 uniformly random real numbers, between 0 and 1 each.

% Multiplying this by 31 stretches the numbers to be between 0 and 31, and "ceil" rounds them up to %integers.

% If the sampling is done without replacement, a simpler command can be used: ra=randperm(31);

Then corresponding values of x and y can be found by using equations:

>>xr=x3(ra);yr=y3(ra);

after which formulas above apply to compute the rho, slope and intercept.

To do this a number of times, one runs a loop:

>> for k=1:5000; ra=ceil(31*rand(31,1));

xr=x3(ra);yr=y3(ra);

cc=corrcoef(xr,yr);rhr(k)=cc(1,2);

sl(k)=rhr(k)*std(yr)/std(xr);

in(k)=mean(yr)-sl(k)*mean(xr);

end

% a "for" loop is done with the variable k running through every single integer

% from 1 to 5000 inclusive; the results are stored in 5000-strong columns

% rhr (correlation coefficients), sl (slopes) and in (intercepts)

Now we can check the mean and standard deviation of the obtained distributions. Commands

>>mean(sl)

>>std(sl)

produce values -1.33 and 0.24, which fits the previous result well but supplements the slope value with a value of the standard deviation.

Similarly, mean and std values for the intercept and rho are produced. They are, respectively, 98.2 / 9.0 and -0.704 / 0.095.

Conventionally, the bootstrap is completed when: the regression parameters’ estimates have been supplied with their standard deviations.

However, I cannot help but making the picture even more impressive and present found distriutions of the parameters in Figure 5 as the result of these commands:

>> subplot(1,2,1)

>> hist(sl,30)

>> subplot(1,2,2)

>> hist(in,30)

[pic]

Figure 5: 30-bin histograms of the slope (left) and intercept (right) after 5000 bootstrapping trials.

Further commands:

>>slh=hist(sl,30)

>>slf=find(slh>=70)

>>sum(slh(slf))

show that 4736 out of 5000 trials fall into histogram bins from 7 to 23. To determine the borders of this area, one finds

>>slbinsize=(max(sl)-min(sl))/30;

>>slleftbound= min(sl)+6*slbinsize

>>slrightbound=max(sl)-7*slbinsize

which produces -1.80 and -0.86 as the right and left boundaries for the slope that hold for 4376/5000=97.4% of the trials.

Similar computations, with

>> inh=hist(in,30);

>> inff=find(inh>60);

>>sum(inh(inff))

will find the left and right and right boundaries for the intercept that hold for 95.1% of the trials (leaving out 8 bins on the left and 5 bins on the right): 81.7 to 117.4.

This all can be put into the same Figure 6 by, first, defining the three regression lines with

>> y3reg=slope*x3+intercept;

>> y3regleft=slleftbound*x3+inleftbound;

>> y3regright=slrightbound*x3+inrightbound;

and plotting then the four sets onto the same figure:

>> plot(x3,y3,'*k',x3,y3reg,'k',x3,y3regleft,'r',x3,y3regright,'r')

% x3,y3,'*k' presents student data as black stars

% x3,y3reg,'k' presents the real regression line in black

% x3,y3regleft,'g' and x3,y3regright,'g' present the boundary regressions

% with green lines

The red lines on Figure 6 show the limits of the regression line for 95% of trials.

[pic]

Figure 6: Regression of CI score over Age (black line) within occupation category AN with boundaries covering 95% of potential biases due to sample fluctuations.

Principal Component Analysis

This is a most popular method for aggregation and visualization of data, also due to efforts by F. Galton and K. Pearson in their quest for “talent inherited”. It can be introduced, in the framework of stud.dat data with feature space reduced to ii=[4 6 7 8], Age and SP, OO, and CI marks – to have things easier visible on the screen, as follows.

Assume that there is a hidden feature z, assigning a value zi to each student i=1,…,100, that can alone, along with feature “loadings” c=(cAge,cSP,cOO, cCI), explain all the 100x4=400 entries in array X=xm=x(:,ii) obtained from stud.dat at the 4 features in ii so that each of the entries can be represented, approximately, as the product of a corresponding z-value and a feature-specific loading:

xiv ( zi cv (i=1, …,100; v=1,2,3,4) (*)

Equations above model 400 entries on the left with 100+4=104 hidden values to be found by minimising the differences. However, taken as such they are not mathematically treatable because of an innate ambiguity: if z and c give a solution to them, then z/3 and 3c provide an equally good solution, zi cv =(zi/3)(3cv). Any other real can be taken as well as 3 here. The ambiguity can be resolved if the set (vector) z and set (vector) c are restricted to be normed (the sum of their entries squared be equal to unity) and whatever quantity the product gets be expressed as a separate factor, (:

xiv ( ( zi cv (i=1, …,100; v=1,2,3,4) (1)

The scoring set (vector) z in (*) is referred to as the Principal component of matrix X.

After hidden values (, zi , cv in (1) are found, they can be used to compute the original model (*) values by multiplying z and c with the square root of (.

The model in (1) can be approximated by minimising the least-squares criterion L2, which is equal to the summary differences squared, with the so-called maximum singular value (* and corresponding singular vectors z* and c* of data matrix X=(xiv) that satisfy the following properties:

1. Vector z* (respectively, vector c*) is the sum of columns of matrix X (rows of matrix X) weighted by components of vector c* (vector z*) and multiplied by (*, which is expressed in matrix terms as follows

XTz*= (*c* and Xc*=(*z* (2)

where XT denotes matrix’s X transpose.

2. The principal component z* takes into account the share of the data scatter whereas the least-squares criterion value is the complementary share:

T(X)= (2 + L2 (3)

The data scatter T(X) is defined as the sum of all entries xiv2, squared, in X.

3. The line through 0 and c* minimizes the squared distances between entity points and their projections onto the line in the feature space; the line’s direction is of the longest axis in the ellipsoid being the image of a unity sphere in the entity space under mapping implied by data matrix X (see Figure 7).

[pic]

Figure 7. Sphere on the unity radius in the entity space (a) and its image, ellipsoid (c=XTz, in the feature space. The first component, c*, corresponds to the maximal axis of the ellipsoid with its length equal to 2(1 .

4. The c* direction can also be defined as the eigen vector of 4x4 matrix XTX corresponding to

its maximum eigenvalue equal to (*2, the matrix XTX being proportional to the feature

covariance matrix if X has been pre-centered by subtracting within-column means.

5. These properties extend to all subspaces generated by the first singular vectors: the first two

make a plane best representing the data, the first three make a 3D space best representing the

data, etc.

The properties imply, among other things, that the principal components do depend on individual feature scales as well as their origin points. It is advisable, thus, to pre-process data by shifting the origin to the mean variable values and by rescaling features according to their ranges.

Specifically, for the case of X=xm, the feature means can be computed with

>> av=mean(xm)

av = 33.6800 58.3900 61.6500 59.8700

and range with

>> ra=max(xm)-min(xm)

ra = 31 56 67 70

To normalise xm with these, we need to repeat them as many times as xm has rows:

>> avm=repmat(av,100,1);

>> ram=repmat(ra,100,1);

Then the standardisation is achieved with

>> xms=(xm-avm)./ram;

Note the subtraction – and division ./, as well as multiplication .*, separately applies to each (i,v) entry of corresponding matrices.

The matrix XTX mentioned in property 4 above can be expressed as follows:

>> xtx=xms'*xms

xtx = 7.5107 1.2987 -2.4989 -2.8167

1.2987 6.0918 0.4335 -0.1543

-2.4989 0.4335 6.0207 1.6660

-2.8167 -0.1543 1.6660 4.7729

As xtx is proportional to feature covariance matrix, one should notice a not quite straightforward character of the data manifesting itself in the fact that features 1 and 2, being co-related positively, have different relations to feature 3.

All four singular values and vectors can be found in Matlab with operation of Singular value decomposition (SVD):

>> [Z,S,C]=svd(xms);

Here matrix Z is 100x4 array whose columns are singular z vectors, C is 4x4 array whose columns are singular c vectors, and the first four rows of array S form a 4x4 diagonal matrix whose diagonal entries are corresponding singular ( values, which can be shaped with command

>> mu=S(1:,4,1:4);

These are subjects that can be covered with this:

- Visualisation

- Evaluation

- Interpretation

We consider them in turn.

To visualise the data onto a 2D plane, we need just two principal components, that can be defined by using the first and second singular values and corresponding z vectors:

>> x1= z(:,1)*sqrt(mu(1,1));

>> x2= z(:,2)*sqrt(mu(2,2));

These can be seen as a 2D plot, on which groups of entities falling in categories such as Occupation:AN (entities from 70 to 100) and Occupation:IT (entities 1 to 35) can be highlighted:

>> subplot(1,2,1), plot(x1,x2,'k.');%Fig. 8, picture on the left

>> subplot(1,2,2), plot(x1,x2,'k.', x1(1:35),x2(1:35),'b^', x1(70:100),x2(70:100),'ro');

[pic]

Figure 8. Scatter plot of student data 4D (Age, SP marks, OO marks, CI marks) row points on the plane of two first principal components, after they have been centred and rescaled in file xsm. Curiously, students of occupations AN (circled) and IT (triangled) occupy contiguous regions, top and left, respectively, of the plane as can be seen on the right-hand picture.

To evaluate how well the data are approximated by the PC plane, according to equation (3) one needs to assess the summary contribution of the first two singular values squared in the total data scatter. To get the squares one can multiply matrix mu by itself and then see the proportion of the first two values in the total:

>> la=mu*mu

la = 11.1889 0 0 0

0 6.4820 0 0

0 0 3.8269 0

0 0 0 2.8982

>> 100*[la(1,1)+la(2,2)]/sum(sum(la))

ans = 72.43

This shows the PC plane takes 72.43% of the data scatter, which is not that bad for this type of data.

To interpret the results one should use the standardised coefficients c in expression (2) that come with svd command into columns of matrix C. The two columns, in transposed form, are:

>> interp=c(:,1:2)'

interp = 0.7316 0.1587 -0.4858 -0.4511 %first singular vector

-0.1405 -0.8933 -0.4165 -0.0937 %second singular vector

The coefficients straightforwardly show how much a principal component is affected by a feature.

First component is positively related to Age and negatively to OO and CI marks; on average, it increases with Age increased and OO/CI marks decreased. Second component increases when SP and OO marks decrease (this obviously can be reverted by swapping all minuses for pluses).Thus, first component can be interpreted as “age-related Computer science deterrence” and the second as “dislike of programming issues”. Then the triangle and circle patterns on the right of Figure 8 show that IT labourers are on the minimum side of the age-related CS deterrence, whereas AN occupations are high on the second component scale.

-----------------------

zN

z1

(*c* cM

c*

c1

x1

(a) (b)

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download