Chapter 1 Basic notion in classical data analysis



Chapter 1 Basic notion in classical data analysis

1.1 Data and Data presentation

The goal of data analysis: discover relations and features in a dataset.

Geophysical data are those that record a measurement and also the location for the measurement. Geophysical data analysis can be understood as using such data to detect, examine, understand or predict events and phenomena that are of relevance to a geophysical study. However, the methods of analysis may not themselves be geophysical. Many of the more common descriptive, inferential and relational statistics make no specific allowance for where the data were collected.

Types of statistic: There are three types of statistic: those that describe and summarise a set of data in its own right – descriptive statistic; those that 'go beyond' the data to infer something more general about the population from which the data were sampled – influential statistic; and those which examine relationships – rational statistic.

Data are measurements of something of interest. They are also called observations because the measurements help us to observe (and to quantify) an attribute of whatever is being studied.

Data type -- Discrete data are those that take on one from a restricted set of possible values. They are usually whole or integer data. Continuous data could take on any value, or any value within a lower and upper limit and to a certain level of precision. They are ‘real’ or ‘floating-point’ numbers.

Some ways to present data --- Table, Bar Chart, Histogram, Line etc..

Bar graph - A bar graph is a way of summarizing a set of categorical data. It displays the data using a number of rectangles, of the same width, each of which represents a particular category. Bar graphs can be displayed horizontally or vertically and they are usually drawn with a gap between the bars (rectangles).

Histogram - A histogram is a way of summarizing data that are measured on an interval scale (either discrete or continuous). It is often used in exploratory data analysis to illustrate the features of the distribution of the data in a convenient form.

Pie chart - A pie chart is used to display a set of categorical data. It is a circle, which is divided into segments. Each segment represents a particular category. The area of each segment is proportional to the number of cases in that category.

Line graph - A line graph is particularly useful when we want to show the trend of a variable over time. Time is displayed on the horizontal axis (x-axis) and the variable is displayed on the vertical axis (y- axis).

Scatter plots show how much one variable is affected by another. The relationship between two variables is called their correlation .

[pic][pic]

[pic]

[pic]

[pic]

2.2 Measures of central tendency

Measures of central tendency and dispersion are common descriptive measures for summarising numerical data.

Simple definition:

Measures of central tendency are measures of the location of the middle or the center of a distribution.

The most frequently used measures of central tendency are the mean, median and mode.

• The mean is obtained by summing the values of all the observations and dividing by the number of observations.

[pic]

• The median (also referred to as the 50th percentile) is the middle value in a sample of ordered values. Half the values are above the median and half are below the median.

• The mode is a value occurring most frequently. It is rarely of any practical use for numerical data.

A comparison of the mean, median and mode can reveal information about skewness, as illustrated in figure below. The mean, median and mode are similar when the distribution is symmetrical. When the distribution is skewed the median is more appropriate as a measure of central tendency.

[pic]

Satistical definition:

Continuous variables

For a random variables [pic] which takes on continuous values over a domain [pic], the expectation is given by an integral,

[pic] (1.1)

where [pic]is the probability density function. For any function [pic] the expectation is

[pic] (1.2)

Example: For a normal distribution: [pic] ~ N(0,1), the expectation of [pic]

[pic]

Discrete variables

Let x be random variable which takes on discrete. For example, x can be the outcome of a die cast, where the possible values are[pic], with i=1,2,…6. The expectation or expected value of [pic]from a population is given by

[pic] (1.3)

Where [pic]is the probability of [pic]occurring. If the die is fair, [pic]is 1/6 for all i,

So [pic]=(1+2+3+4+5+6)/6=3.5. We also write

[pic]

with [pic] denoting the mean of [pic]for the population.

Similarly, for any discrete function f(x), its expectation is

[pic] (1.4)

The expectation of a sum of random variables satisfies

[pic] (1.5)

where [pic] and [pic]are random variables, and [pic]are constants.

In practice, one can only sample N measurements of [pic]([pic], …,[pic]) from the population. The sample mean [pic] or is calculated as

[pic] (1.6)

which is in general different from the population mean [pic]. As the same size increases, the same mean approaches the population mean.

The function of “mean” in MATLAB is mean(x)

3. Variance and covariance

A measure of central tendency is useful but says nothing about how the data are distributed around it. Knowing only an average is not, by itself, especially helpful.

Example:Imagine measurements are taken of a pollutant in each of three streams on 10 different days. The first stream is found to have a mean pollution concentration of 6.1 units, the second stream has 9.1 units and the third has 5.6 units. If the threshold beyond which a stream becomes toxic is 10 units, then which is the stream to worry about?

Intuitively the answer is the second stream because its mean (of 9.1) is closest to 10. But that intuition could be misleading without knowledge of how the individual measurements vary around it, as shown below.

[pic]

In general, a measure of central tendency should always be accompanied by a measure of spread, which is a fluctuation about the mean value. It is commonly characterized by the variance of the population,

[pic] (1.7)

where (1.5) has been invoked. The standard deviation s is the positive square root of the population variance, i.e.,

[pic] (1.8)

The sample standard deviation [pic] (1.9)

As the sample size increases, the sample variance approaches the population variance. For large N, distinction is often not made between having N-1 or N in the denominator of (1.9).

Degrees of freedom

The degrees of freedom are the amount of flexibility you have to change the values of some observations if certain properties of the data set are fixed. Those properties are often referred to as parameters. Imagine somebody takes some playing cards and, having removed the picture cards, looks at 10 cards at the top of the pack. That person then tells you the sum of the 10 cards is sixty before dealing them face down. The question is: what is the maximum number of cards you need to turn up before you know the face value of them all? The answer is nine. You know there are 10 cards and you know the sum of their values. By deducting the values of the first 9 cards from sixty, you can determine the value of the final card.

Alternatively, consider a data set containing n = 10 observation for which the mean is known and must stay constant. You can take any nine of the observations and change their values to anything you like. But, having done so, you have no choice about what the tenth element equals: its value is fixed by the mean and by the other observations. Your ‘degrees of freedom’ are limited to n - 1 of the data values. Be aware that the number of degrees of freedom is not always n - 1. It depends on the parameters of the data set required for a particular test or measure. A formal definition of degrees of freedom is the sample size, n, minus the number of parameters, p, estimated from the data.

Normalization: The goal is to make different variables to be measured and compared in the same scale, i.e., an effective way to remove the influence of units.

For example, we would like to compare two very different variables, e.g., sea surface temperature and fish population. Simply, one can’t even draw their variations in a plot due to different units. So, one usually standardizes the variables before making the comparison. The standardized variable

[pic] (1.10)

is obtained from the original variables by subtracting the sample mean and dividing by the sample standard deviation. The standardized variable is also called the normalized variable or the standardized anomaly (where anomaly means the deviation from the mean value).

Covariance

Covariance measures how two variables move together. It measures whether the two move in the same direction (a positive covariance) or in opposite directions (a negative covariance). In this article, the variables will usually be stock prices, but they can be anything.

For two random variables [pic] and [pic], with mean [pic] and [pic] respectively, their covariance is given by

[pic] (1.11)

The variance is simply a special case of the covariance, with

[pic] (1.12)

The sample covariance is computed as

[pic] (1.13)

The function of variance, standard deviation and covariance in MATLAB is var(x) , std(x), cov(x,y).

4. Skewness:

Skewness is a measure of symmetry or more precisely, the lack of symmetry.

[pic] (1.14)

where [pic] and [pic] are the mean and standard deviation of [pic]. As one might expect, the formula takes on a positive value if [pic] is positively skewed and a negative value if [pic] is negatively skewed

A distribution, or data set, is symmetric if it looks the same to the left and right of the center point, i.e., skew = 0. If the left tail (tail at small end of the the distribution) is more pronounced that the right tail (tail at the large end of the distribution), the skew is negative . If the reverse is true, the skew is positive. Distributions with positive skew are sometimes called "skewed to the right" whereas distributions with negative skew are called "skewed to the left."

|Positive vs. Negative Skewness |

+ skew - skew

[pic]

These graphs illustrate the notion of skewness. Both PDFs have the same expectation

and variance. The one on the left is positively skewed. The one on the right is

negatively skewed.

5. Examples of using these concepts to analysis practical problems.

Standard Deviation

[pic]

Covariance

For the data sets X = 65.21, 64.75, 65.26, 65.76, 65.96 and Y = 67.25, 66.39, 66.12, 65.70, 66.64, find the covariance to estimate the linear relationship between the two data sets X & Y.

Solution

Sum(X) = 65.21 + 64.75 + 65.26 + 65.76 + 65.96 

= 326.93

μx = 326.93 / 5

= 65.38

Sum(Y) =67.25 + 66.39 + 66.12 + 65.70 + 66.64 = 332.09

μy = 332.09 / 5

= 66.42

cov(X,Y) = (SUM(xi - μx) * SUM (yi - μy)) / (n - 1)

= (65.21 - 65.38) * (67.25 - 66.42) + (64.75 - 65.38) * (66.39 - 66.42) + (65.26 - 65.38) * (66.12 - 66.42) + (65.76 - 65.38) * (65.7 - 66.42) + (65.96 - 65.38) * (66.64 - 66.42))/4

= -0.058

[pic]

[pic]

[pic]

Shown above are two cases: El Nino and La Nina. Comparing the two cases reveals: (1) the largest variations occur at the eastern Pacific ocean; (2) El Nino and La Nina are asymmetric.

These two features could be explained by the below figure. As can be seen,

the largest magnitudes of variances appear over the eastern Pacific, coinciding with the above figures. Similarly, the large positive skewness occupies the equatorial eastern boundary, and small negative skewness appears in the west, indicating the asymmetry shown between El Nino and La Nina.

[pic]

Reference:

Harris, Richard; Jarvis, Claire. Statistics for Geography and Environmental Science (Kindle Location 1028). Taylor and Francis. Kindle Edition, chapter 2.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download