Iterative refinement clustering algorithms (e



Rafal Ladysz: FINAL PROJECT PAPER

for INFS 795, Fall 2004 (Professor Carlotta Domeniconi)

CLUSTERING OF EVOLVING TIME SERIES DATA:

an attempt to optimize the number and initial positions of cluster centroids

for k-means clustering with regards to goodness and accuracy

using the Euclidean and Dynamic Time Warping distance measures

Content

INTRODUCTION 1

OBVECTIVES AND MOTIVATION 2

THEORETICAL BACKGROUND AND OTHER WORKS 2

CHARACTERISTICS OF K-MEANS AND EM ALGORITHMS 3

SIMILARITY MEASURES 3

EXPERIMENTAL SETUP 5

OBJECTS EXTRACTION 5

INITIALIZATION 6

ALGORITHMS 7

SOFTWARE TOOLS 8

DATA 9

EXPERIMENTAL RESULTS 10

EUCLIDEAN VS. DTW SIMILARITY MEASURES 10

ROLE OF WEIGHTING 10

DIMENSIONALITY 11

DATA BIAS 11

FINAL ACCURACY EVALUATION 11

SUMMARY 12

FUTURE WORKS 12

REFERENCES 13

APPENDIX 14

INTRODUCTION

Clustering is unsupervised data mining technique, where objects under investigation are grouped into subsets – disjoint or overlapping – of the original set of objects.

Importance of the clustering as a data mining method stems from the fact that it does not (in general) assume any knowledge of the objects clustered, hence does not use any labels (classes) which are present in classification techniques (training-testing-evaluation). As such scenario (of no prior knowledge) is very often in real life, the clustering is a method of choice when one attempts to gather some general knowledge about scrutinized objects, usually being interested in their similarities and differences.

Clustering can be also applied when the knowledge on the objects exists, but the goal is to derive some new knowledge. This is often done in subspaces of the “original” domain space where the objects are described (in their full dimensionality). For instance, some geographical region can be divided into a number of climatological zones, but taking into account only subset of characterizing parameters (e.g. temperature and precipitation), some different partition with different number of clusters and their respective members might turn out a “better” choice.

It is important what kind of data is to be clustered. Time series is widely used form of data, expressed as a sequence of real numbers, where each moment in time series history is assigned a numeric value. Put another way, time series is a discrete function of time, often characterizing changes of some observable value(s) along time (as independent variable). Time series can be considered as a special case of data stream, where the latter can be multidimensional as opposed to scalar (one-dimensional) character of time series.

So called event data must not be confused with time series, which “is (...) a sequence of real numbers representing measurements of a variable taken at equal time intervals. Techniques developed for time series require numerical features, and do not support target events within a time frame” [1].

Because of the above, clustering time series – being an intersection of two widely visited area - is an important data analysis activity which helps discover deeply hidden patterns, trends features and indicators of intrinsic dynamics of the underlying system (represented by the time series). The method is somehow “similar” to correlation analysis. For instance – taking one more example from climatology – if a region can be divided into as many zones regarding temperature and precipitation as regarding temperature and pressure, it might suggest higher correlation between pressure and precipitation (obviously, correlation does not mean causality).

In this project the focus was made on mainly just one clustering techniques: k-means, while only rudimentary comparison has been made between it and Expectation-Maximization (EM). Two distance (similarity) measures have been applied: Euclidean and Dynamic Time Warping (DTW). Three groups of time series data have been used: climatological, medical and financial (each represented by two real data sets). Such data diversity should help obtaining less biased results, as well as get some insight into these mutually complementary domain, representing micro- and macro-scale, natural- and cultural-driven scenarios characterized by their respective time series.

OBVECTIVES AND MOTIVATION

The core objectives of this project are two-fold:

1. investigating k-means clustering algorithm in terms of its sensitivity to initial conditions, i.e. initial clusters centroid positions and number of the centroids,

2. comparing Euclidean and DTW distance measures in how they influence what is to be investigated in 1.

Because of limitation of time and software tools available during the project realization, analogical research for EM clustering – initially included in the project scope – are considered as possible future work.

The idea and motivation for this project origin from another one where time series have been investigated wrt underlying chaos (see Appendix for explanation), using dedicated software tool. A parameter called the largest Lyapunov exponent (LE, see Appendix again for explanation) was computed and all drastic changes were indexed, enabling mapping the timing of those changes back to the original time series (Figure 1, p. 4), thus generating a set of time series subsequences which can be considered as objects of clustering. Successful implementation of real-time clustering of such subsequences as evolving time series data could improve accuracy of the mentioned algorithm computing LE by increasing its sensitivity (but not predictive capability) of detected signal.

I believe this project is non-trivial in its encountering two challenges: one is as old as partition-based clustering and regards optimization of number of centroids, which is considered an NP-hard problem, the second stems from the experimental specificity where also unknown is the number of objects to be clustered (time series subsequences). These experimental constraints makes the project interesting and challenging. It can be considered as the first step towards obtaining some experience in the area, hence the project itself and its results should help as landmark and framework for further attempts in this direction.

THEORETICAL BACKGROUND AND OTHER WORKS

Clustering time series has a long experimental history. Most frequently, the experimentators are interested in finding similar patterns among the objects, using either of two main approaches: whole clustering and subsequence. In the first method given a set of individual time series data, the objective is to group similar time series into the same cluster, whereas the in second one,

given a single time series, its subsequences are extracted with a by the means of “sliding window” [2].

Gunopulas et al. [3] use, besides the Euclidean similarity measure, also Longest Common Subspace -like similarity (with global scaling).

Perng et al. [4] introduce similarity definition congenial to human perception.

Clarkson [5] uses segmental k-means clustering with training. Segmental k-means operates with models, a number of samples per state, a number of states per model. Initializes N models and resegments each cluster membership for each model. Training is performed to estimate new model for a segment. Keeps resegmentation until convergence.

Vlachos et al. [6] suggest approach based on multiple distance measures, including the Longest Common Subspace and Dynamic Time Warping, denouncing Euclidean distance measure as too vulnerable to noise.

Aggarval et al. [7] describe an interesting approach to clustering of evolving data streams (a generalized concept of scalar time series) introducing the Stream Clustering Framework, called CluStream, consisting of separate on- and offline processing.

Daewon et al. [8] introduces so called Pseudo PCA for cluster analysis of time series, where after applying so called phase space reconstruction (see Appendix for explanation) to 1-dimensional scalar data, the dimensionality reduction is performed.

Calinski et al. [9] were among those who attempted fundamental issue of heuristics for optimal number of clusters. Their pseudo-F statistic, called also C-H index, is often used as the objective function to determine optimal k. It can be described by the following formula:

C-H(k) = (B(K)/(k-1))/(W(k)/(n-k))

where B(k) = between cluster sum of squares, W(k) = within cluster sum of squares, k is the number of clusters and CH(k) is to be maximized over the clusters.

Legendre [10] uses the C-H index in its cascade-like strategy, where user starts with k1 groups and ends with k2 groups (k1 ( k2). Cascading from a larger to the next smaller of groups, the two closest groups are identified and fused, followed by two-step alternating least-squares algorithm, run until convergence. The two steps are: (1) computing cluster centroids and using them as new cluster seeds and (2) assigning each object to its nearest seed. The author also considers weighting attributes.

It should be mentioned that there are many other methods of optimizing number of clusters, for instance Ebdon [11] suggests the root mean square deviation (RMSD) to be used to determine the optimum number of clusters, where optimum solution usually have the smallest number of clusters without clustering in the same partition points which are “naturally” dissimilar. Higher values of the RMSD denotes increased variation within same clusters.

The approach taken in this project is yet different, facing the brutal reality of unknown number of objects and (of course), partitions, while taking into account one more target constraint: requirement of on-the-fly processing. Testing this requirement is out of scope of this project and can be considered as a possible future work closely related to it.

CHARACTERISTICS OF K-MEANS AND EM ALGORITHMS

The k-means algorithm has been designed by MacQueen in 1967 [12] and is probably the most popular and simplest clustering technique. It can be characterized briefly as follows:

– suboptimal and greedy (susceptible to local minima)

– sensitive to initial conditions and outliers: because of NP-hardness, random initialization

– requires number of clusters (k) as part of the input

– Euclidean distance being its most natural similarity measure, allowing spherical zones only

– partitions are mutually disjoint (exclusive – no overlap allowed)

– guaranteed to terminate.

The k-means has some its variants, e.g. k-modes (process categorical data), k-medoids (operates with medians rather than means, i.e. real objects, being thus more robust in presence of outliers), and EM, which is another widely used partition-based clustering algorithm.

The EM can be considered as a generalization of k-means to probabilistic setting (maintains probability of membership of all clusters rather than assign elements to initial clusters). It works iteratively: initialize means and covariance matrix while the convergence criteria is not met compute the probability of each data belonging to each cluster, recomputes the cluster distributions using the current membership probabilities, clusters probabilities are stored as instance weights using means and standard deviations of the attributes, and terminates when likelihood saturates. The EM is useful when for many dimensions the range of values differs significantly, but it also is sensitive to initial conditions.

SIMILARITY MEASURES

The Euclidean distance between two objects A and B in N-dimensional space, defined as

DE = ((i = 1N (xA - xB)2)1/2

is one of the most commonly used distance measures, where each object, i.e. time series subsequence of length N, is considered as a point in N-dimensional space.

Advantages of Euclidean distance measure include: simplicity and natural, intuitive sense. Disadvantages are: high sensitivity to noise and outliers (especially for sparse data), covering only spherical domain space, demand for extensive data preprocessing if to be applied as time series similarity measure (what was the case of this project). Clearly, in spite of its merits, the Euclidean distance is not similarity measure of choice for time series.

In real life scenario very often small distortions take place, causing “delays” in time of some phenomena being reactions to or otherwise caused by their direct or indirect stimuli. To account for such artifacts so called Dynamic Time Warping (DTW) was proposed by Sankoff and Kruskal in 1983. The DTW as a distance measure minimizes dissimilarity by aligning two sequences as in the figure below.

Note.

This technique emerged originally for speech recognition in the 1970’s. In speech recognition the alignment is necessary due to different parts of the words being spoken at different rates. Therefore, the best alignment of the word to be recognized and a reference pattern, identifies the word to be the same as the one represented by the reference pattern.

[pic]

Fig. 2: Comparison of Euclidean (left) and Dynamic Time Warping (right) distance measures between two time series sequences (marked red and blue) in terms of relations between their respective points (one-to-one vs. many-to-many)

The DTW uses the following formula to compute distance:

where γ(i, j) is the cumulative distance of the distance d(i, j) and its minimum cumulative distance among the adjacent cells. The idea is depicted in Figure 3 below:

[pic]

Fig. 3: Idea of the DTW distance measure - general case of different lengths (in this projects all lengths were equal within each experimental run); for Euclidean distance the red “path” would be a straight diagonal (for equal distances)

Similarity between already aligned sequences becomes the conventional Euclidean distance between the two series – thus the DTW can be considered in a sense as a generalization of the Euclidean distance measure.

Advantages of DTW include optimality and relatively wide area of application, not only handwriting and speech recognition. Among disadvantages are O(N2) time and space complexity and limitation of the datasets size (up to 10,000). There are, however, methods to lower time complexity of DTW include: filling only part of the cost matrix, undersampling the data before time warping (used in this project) or reducing the number of times DTW runs by indexing.

EXPERIMENTAL SETUP

The experimental part referring to centroids optimization have been designed in the following structured way:

1. use all 6 datasets preprocessed as follows:

– extract subsequences (initial points using the LET program)

– index each subsequence for unique identification allowing for dimensions 30 and 100

– remove outliers (very few cases found, thus performed manually)

– mean-shift (subtracting subsequence-specific mean value from all points) [18]

– apply weights w = 1 and w = 1.05 geometrically for the first 20 points in each subsequence

– normalize both weighted (w = 1.05) and unweighed (w = 1) data (in 0-9 scale)

– store each preprocessed dataset in its matrix

– format input files for the SPSS (and for EEG dataset also for Weka ARFF format)

2. for each dataset, precompute DTW distances pairwise, creating the distance matrices

3. run centroid optimization using two merge-or-not radius (R) values for

– Euclidean distance: with build in distance computing function

– DTW-distance: using precomputed distance matrix as input from (2)

4. record and store resulting centroids positions (and, implicitly, their numbers k)

5. convert centroids positions file into SPSS-readable (to utilize “Centers initial file” option)

6. for all the above 6 datasets run k-means clustering using two options

– without initial cluster positions, trying k < K

– as above, but with k > K

– with cluster positions from (4), i.e. using K

7. compare results for a, b and c, and in particular:

– compute Calinski-Harabash index (called pseudo-F statistic, based on ANOVA)

– perform visual accuracy evaluation (confusion matrix)

8. draw and report conclusions.

Comments:

- normalization: usually it matters for clustering whether data is normalized (none dimensionality favorized if incomparable scales, e.g. age and salary), as well how the data are scaled, but in this case all dimensions have been treated same way and none was more important than another, except when weigh was applied after all other preprocessing steps; the 0-9 scale was forced by limitation of the DTW software (the only available at this time)

- Celinski-Harabasz index can be interpreted as goodness (rather than accuracy) of particular k-partition.

OBJECTS EXTRACTION

Clustering evolving time series data is specific because it is not known upfront how many objects – i.e. time series subsequences – are to be dealt with. Another challenge is how to extract the subsequences so that they can be compared against each other (to evaluate their mutual similarities) with minimal bias. The latter has been resolved by using a proprietary software tool called Lyapunov Exponent Tracker (LET) designed and implemented to compute the largest Lyapunov (LE) exponent of time series data as a discrete function of time.

To better understand role of LET in this experiment and its background, it may be useful to look at its output (graphed using Excel):

[pic]

Fig. 1: The input data is visualized in the lower graph (EEG vs. time); the upper graph shows values of LE vs. time used as criterion for generating “cutting off” points for subsequences (when sudden and substantial change of the LE occurs)

Extracting subsequences results from evaluation of the LE: whenever its change is evaluated as “drastic” (i.e. exceeding preset threshold), the underlying time index is mapped back to original input data (EEG) to determine beginning of the subsequence. As the subsequence lengths in all experiments are assume to be equal (within any run for any time series data), the starting point uniquely determines the whole subsequence as an object to be clustered. The number of such objects (subsequences) depends on the number of sample points in time series input data (from 6000 to 18000) and what subsequence length, being also its dimensionality, has been decided (30 or 100).

Comments on object extraction.

Obviously, such a method of extracting subsequences is somehow “arbitrary”, nevertheless the whole project origins from idea of boosting performance of the LET by the means of on-the-fly time series clustering, thus from that perspective this is method of choice regarding the objective related to LET. There might be alternative, “more general” methods of extracting time series subsequences, though. Another possibility is collecting time series as already single subsequences. In this experiment the only method used was applying the LET.

Because of the above “connection” the question about biaslessness can be raised at this point. Without any rigorous statistical considerations it can be said that the cut-off points generated based on chaotic behavior takes place in another space than clustering. This circumstance, at least qualitatively, might be a tentative argument for limited bias in extracting time series subsequences.

INITIALIZATION

It is well known that results of k-means (and EM as well) depend strongly on initial positions of the cluster centroids and their number (the latter being user-defined and thus part of the input). The following algorithm was written and implemented to acquire, evaluate and index the cluster centroid before real clustering begun (see also Fig. 4). Listed below is a simplified sketch of one of the algorithms used in this experiment. It facilitate acquiring subsequent objects (subsequences originally stored in a database and converted into matrices).

ALGORITHMS

To simulate the preclustering process, the following algorithm has been used.

CENTROID EXTRACTION (subsequences; k_max; R; ratio_min)

...

PREPROCESS: put all subsequences into NxM matrix (N is number subsequences, M is dimensionality)

...

SIMULATE(Euclidean distance)

FOR k = 1 to k_max - 1

IF k = 1 THEN

centroid(k; d_k_1, d_k_2, ..., d_k_M) 'first centroid created no matter what

number_of_centroids = k

attempts = 1

ELSE

FOR kk = 1 TO k-1

attempts = attempts + 1

FOR j = 1 to M

find maximum range for centroids acquired so far (orange arrow on Figure 4)

find minimum distance D_min for current centroid candidate to all centroids generated so far

IF D_min < R (R is the merge-or-not red orb around each generated centroid, see Figure 4)

discard candidate, fuse with its nearest neighboring centroid

ELSE

centroid(k; d_k_1, d_k_2, ..., d_k_M) generate new centroid

number_of_centroids = number_of_centroids + 1

IF number_of_centroids > k_max - 1 THEN

RETURN all generated centroids and their overall C-H index

ELSE

compute ratio = number_of_centroids/attempts

IF ratio < ratio_min THEN

RETURN all generated centroids and their overall C-H index

END IF

END IF

END IF

NEXT j

NEXT kk

END IF

NEXT k

...

The above procedure used Euclidean distance. In order to use DTW distance, the following algorithm has been used (according to the definition of DTW) for implementation:

...

COMPUTE_LEAST_GLOBAL_COST(PREDECESSOR COLUMS)

FOR i = 1 to M (dimensionality)

currentColumn[0] = local cost at (i, 0) + global cost at (i-1, 0)

FOR j = 1 to M

currentColumn = local cost at (i, j) + minimum of global costs at {(i-1, j), (i-1, j-1), (i, j-1)}

NEXT j

previousColumn =currentColumn

NEXT i

The above pseudocode was written after [13]. The implementation of the algorithm (developed further) failed delivering reasonable results, so publicly available DTW program (described in the Software Tools section) has been used to create DTW-based distance matrices, which were then passed on as additional input data to the Centroid Extraction procedure, this time using slightly modified Simulate(Euclidean distance). This time the results were usable.

[pic]

Fig. 4: A simple two-dimensional visualization of the initialization process (CENTROID EXTRACTION description, p. 6). As new centroids candidates, marked as small black dots are acquired and tested, the domain space range is monotonically non-decreasing (orange diagonals). The red shields around black dots symbolize “merge-or-not” tolerance area. For t = 3 a merge of C1 and C3 occurs.

SOFTWARE TOOLS

The following software tools have been used in the project:

– Weka (), to run k-means and EM clustering

– SPSS (), to run k-means clustering using initial seeds

– Microsoft Access, Microsoft Excel, Microsoft Visual Studio (), for data storage, preprocessing, graph creating and software development

– publicly available DTW (only executable, runs under DOS) to compute DTD distance

– Lyapunov Exponent Tracker (LET) to determine starting points of time series subsequences

– and ad hoc created programs for data preprocessing (using MS Visual Studio).

A brief description of each of them and how they were used follows.

SPSS (what stands for Statistical Package for Sociological Science), on the other hand, being a general statistical (rather than machine learning) package, allows using initial centroid positions but does not support EM algorithm. Thus, SPSS enabled evaluation of the impact of initial centroid position on overall clustering results, but for k-means algorithm only. This limitation, though apparently serious, does not seem to be decisive, because in general EM accuracy is superior to that of k-means achieved under identical circumstances.

Weka is a machine learning open source code-based package implemented in Java. It supports following clustering algorithms: SimpleKMeans: using the k means algorithm, with options for number of clusters and random number seed, and the EM: using expectation maximization, with options for maximum number of iterations, minimum allowable standard deviation, number of clusters, random number seed. There are three other clustering methods available: Cobweb, FarthestFirst and a density-based algorithm. Out of those described above only two first (k-means and EM) have been used for clustering the time series data in this project. It has been found that Weka has a serious limitation in its clustering algorithms - it does not support accepting any initial positions of cluster centroids if any are computed in the initialization preprocessing (or at least such ability has not been found). For that reason Weka has been used mainly for performance comparison between EM and k-means (run for the same data and using the same Euclidean distance measure), where initial centroid are generated randomly (iteratively or not).

Access and Excel (both from Microsoft) have been used for initial preprocessing and storing the data, and in particular:

- indexing subsequences and points within subsequences

- applying different weights to different dimensions (consecutive data points within each subsequence)

- normalizing and rescaling data (necessary to use proprietary DTW program)

- generating graphs

Lyapunov Exponent Tracker (LET) has been designed and implemented for another project, related to mining chaotic time series, and was helpful to extract time series subsequences by marking (chronologically) points in input data for which the computed parameter, i.e. Largest Lyapunov exponent, suddenly decreased, thus indicating qualitative change of behavior of the underlying dynamic system. Those points were regarded as beginning of the subsequences, what uniquely defined each of them since their lengths were equal and arbitrarily chosen.

DTW program was used to compute DTW-based distance pairwise (in spite of my request no source code was available during realization of the project)

Visual Studio (from Microsoft again) has been used to create ad hoc simple programs to preprocess data and to perform initialization in terms of number of initial cluster seeds and their positions.

DATA

As already stated, the data origin from very different sources to minimize any potential data-related experimental bias. The data have been collected from following sources (URLs):

- ECG:

- temperature:

- BP vs. USD:

- NYSE:

- El Nino SOI:

- EEG:

Overall, a brief quantitative description of the data used in the project is given below:

|ECG |temperature |BP vs. USD |NYSE |El Nino SOI |EEG |

mean |-0.24 |mean |48.84 |mean |0.61 |mean |199.24 |mean |-3.28 |mean |-0.01 | |med. |-0.15 |med. |50.00 |med. |0.62 |med. |152.27 |med. |-2.46 |med. |0.00 | |range |4.84 |range |107.00 |range |0.54 |range |547.51 |range |136.74 |range |731.00 | |min. |-3.79 |min. |-15.00 |min. |0.41 |min. |51.64 |min. |-85.72 |min. |335.00 | |max. |1.05 |max. |92.00 |max. |0.95 |max. |599.15 |max. |51.02 |max. |396.00 | |count |14998 |count |6938 |count |4774 |count |5055 |count |4904 |count |18432 | |Table 1: overall characteristics of the input time series data

The plot below visualizes diversity of the nature of the data (only parts of three source visible):

[pic]

Fig. 5: Visualization of the input data and its diversity: EEG (brown), New York Stock Exchange (blue), daily temperature (violet). Data have not been normalize before plotting.

Each of the time series can be viewed as: macro- or micro-scale (size of underlying system), natural or cultural (nature- or human-generated), financial, medical or meteorological (application area). That diversity is reflected (at least partially) in Figure 5, showing very different characteristics in terms of periodicity, range, mean etc. For more formal data description see also Table 1 (generated in Excel).

EXPERIMENTAL RESULTS

EUCLIDEAN VS. DTW SIMILARITY MEASURES

Generating DTW-based distances between time series subsequences gave in each case less values than corresponding distance computed using the Euclidean distance (the detailed, numeric comparison of the both metrics are given in Table 2 in Appendix). That part of results was not surprising since due to definition of the DTW, it minimizes the distance. What was surprising was to what extent that effect took place and thus influenced initialization phase (called in this project “preclustering”). Even though the “DTW effect” evidently depends on the underlying (shape of) time series, its dramatically shorter distances (about 50% by average) resulted in less number of initial cluster centroids (seeds) generated during initialization. Interpretation of this consequence can be two-fold: minimizing number of clusters is, in general, desired (unless it does not force very different objects in same partitions), although at this phase (preclustering) it is hard to evaluate it in terms of accuracy, so the whole clustering – based on initial seed positions – was used for the evaluation. On the other hand, the decreased distance can be also evaluated out of context of the clustering, but just visually comparing shapes of the objects (subsequences).

It is impressive to see how “smart” the DTW works, especially if phase shift is rather than shape differences occur between two subsequences (see Figure 6 as an illustration).

[pic]

Fig. 6: The two subsequences of very similar shapes but “phase-shifted”; their Euclidean distance is 67, whereas DTW only 48 (coordinates omitted as not pertaining to real dimensionality). The gray area between the curves symbolizes the dissimilarity (distance) between them.

Even though it is claimed that DTW is much less error prone that Euclidean, one should not escape consideration whether the DTW-based approach is always justified. The DTW has been designed for voice recognition, where different speeds of person-specific speeches in fact translate into the same “phonic-semantic” values (e.g. “locomoootive” ( “looocomotive”). One possible justification of using DTW in case of time series data used in this experiment is following reasoning: The phenomena considered here (financial, meteorological or medical) have very complex underlying structure and are determined by many factors, independent or related. The system response to some external factor is thus answered in all that complexity, which might result in some delays. For instance, investors’ reaction might be delayed by communication problems, change of temperature by recent rain (more water absorbs the heat), etc. Nevertheless, general rules hold and the expected effect “usually” takes place, even though it is delayed. As of now, this (simplified) argumentation is to justify using the DTW. Such approach is also strongly advocated (possibly for another reasons) by Keogh [14].

Any observable effect of DTW-based distance evaluation on clustering accuracy can be explained by its tendency to minimize number of clusters during the centroid optimization phase, which in turn results from less pairwise distances computed between pair of (already generated) centroids. For correctness, however, it should be added that during the experiment it apparently did not happen that – due to parameters chosen – the resulting number of centroids was too small to force the real clustering (performed in SPSS) forcing objects in clusters they should not belong to.

ROLE OF WEIGHTING

There was no measurable effect of applying the weights to subsequence “beginning” points, thus favorizing points located closer to starting point of their respective subsequence. That should not be surprising, because – in this phase of the experiment – the only effect weights had was deforming shapes for subsequences, identical for all of them. The effects of weights might be, however, visible when the paired with the LET algorithm, since – by definition of its functionality and the way cut off points were generated – its reaction is spurred just by the first points (put another way: what is effect of the LET, is cause for clustering initialization).

DIMENSIONALITY

As it turned out, increasing dimensionality (number of sample points in each subsequence) resulted in decreasing of the number of initial centroids. That could be explained by pairwise similarities between subsequences decreasing while their respectivelengths (thus by definition also dimensionalities) increased. That resulted in greater dissimilarity and tendency to less centroids generated during initialization (the same effect took place as with using DTW, though it was due to different cause). Table 3 shows these results as increased accuracy.

Talking about dimensionality requires at least one more comment. In case of classification the accuracy usually grows while dimensionality is upgraded (more information is provided for analysis), and usually at some point this trend gets either saturated (more information via dimensionality does not contribute any discriminative information), or even spoils the accuracy (overfitting-like effect acquired during training badly affects classification during testing). In case of unsupervised methods like partitioning-based clustering, the dimensionality was tested only for two values (30 and 100), thus it might be risky to draw too far reaching conclusions, but experimental results somehow confirm what had been expected (i.e. that higher dimensionality will be more discriminative in evaluating similarity).

DATA BIAS

It has been observed that the method of optimizing initial centroids delivers better results for time series exhibiting some noticeable periodicity, like temperature or SOI. Possible explanation of this might be the fact that during the centroid optimization phase of time series processing (i.e. what the Centroid Extraction algorithm does), first k objects (subsequences) are more “representative” for the whole time series in the case of periodic data, e.g. stock exchange (NYSE) or currency exchange. For the case of EEG centroids optimization effect was in the middle - what should not be surprised as it consists of two regions: quasiperiodic (seizure free) and totally a periodic (representing the seizure). The corresponding effects of centroid optimization for those data sets are given in Table 4.

The above observation falls well into experimental scenario called by E. Keogh “data bias”, who gives following definition: “Data bias is the conscious or unconscious use of a particular set of testing data to confirm a desired finding” [15]. Keogh recommends, in order to minimize this effect, use data from as different sources as possible. This paradigm has been followed in the project.

FINAL ACCURACY EVALUATION

Evaluation of clustering is, in general, an open question – as opposed to supervised learning techniques, where unique labeling is present. In general, one can qualitatively evaluate similarity between clustered objects e. g. time series subsequences) just looking at the results, i.e. creating plots out of final centroid and comparing the curve shape with shapes of extracted and clustered subsequences. Such approach is enabled by the SPSS output and the visual evaluation has been done as follows.

- for each centroid, collect its all coordinates (very easy in SPSS)

- create k*D matrix (where k is number of clusters and D is dimensionality)

- make multiple D-plot (in Excel).

This way one can observe if the resulting subsequences, even though not “existing” in input data, resemble well any subsequences of the whole time series, plus if they exhaust all detectable “shapes”. The latter can be also easily plotted using cut-off indices and dimensionalities (30 or 100 points), though this was very time consuming. This approach is illustrated in Figure 7 in Appendix.

Probably better and more objective way of evaluation of impact of initialization time series clustering error rate is what was attempted in this project for the EEG and stock exchange data sets, namely:

- labeling manually subsequences

- appling k-means with ignoring the labels

- evaluate agreement between humen- and machine-based grouping

- based on the above compute accuracy in terms of error rate.

The results can be find in Table 5.

Surprisingly, initialization improved the accuracy by about 20% for stock market data which were very irregular, while increase error rate by about 37% for EEG data. Possible explanation of the deteriorating accuracy for EEG can be that – as already noticed earlier – the first k_max (or even less) centroids captured in the initialization phase might not be sufficiently representative for full reachness of the EEG signal, which exhibit totally different bahavior in its middle part. That part has not been reached during capturing centroid candidates!

SUMMARY

Two main challenges have been undertaken in this project: optimizing k-means clustering via optimizing number and positions of initial centroids and doing so for evolving time series data.

It is known that in general, clustering is NP-hard problem, and for that reason most partitioning algorithms assumes initial centroids positions randomly, alike their number. Things get even more complicated if it is not known the number of objects to be clustered, as it is in the case of evolving time series. Even though experimental reality did not allow for “evolving infinite time series”, its simulation helped experience at least some of the difficulties and draw following conclusions:

- regarding real time processing:

o data structure for faster retrieval should be used

o on-the-fly data undersampling (to facilitate fast DTW)

o updating centroids positions as time series evolve, thus adjust to possible dramatic change in the signal behavior

- regarding clustering algorithms to be used:

o shifting attention to probabilistic algorithms (e.g. EM)

o implement own DTW algorithm (with possible modifications taking into account well defined constraints, e.g. path width)

- regarding data: experiment with even more data from tatally different domains.

Finally, some more automation of customized data preprocessing will definitely be needed as it turned out this phase of the project was definitely most time consuming.

FUTURE WORKS

General comparison of performance of k-means and Expectation Maximization (EM) with conjecture that it should perform at least not worse than k-means under the same conditions (with and without initialization, using Euclidean and DTW during initialization, same input data etc.) should be carried out. Shifting focus to EM can be justified by “unpredictable” nature of evolving time series, where both number of objects and any information about newly arriving subsequences is totally unknown. In such scenario probabilistic approach seems to be more appropriate (EM is probabilistic generalization of k-means in assuming probabilities of memberships to particular partition – breaking k-means exclusivity).

Implementing functional, dynamic programming-based algorithm for computing DTW distance should facilitate more realistic conditions for testing real time processing requirement, currently performed in off-line fashion.

Finally, designing and implementing data structure for efficient storing and fast retrieval of time series subsequences data [16]. That might help recursively optimize the centroids, i.e. “update” available knowledge about evolving subsequences and prevent from adverse effect describe for EEG data.

REFERENCES

[1] C. Domeniconi et al.: "A Classification Approach for Prediction of Target Events in Temporal

Sequences", in Proceedings of the 6th European Conference on Principles and Practice of

Knowledge Discovery in Databases (PKDD), 2002

[2] Keogh et al.: “Clustering of Time Series Subsequences is Meaningless: Implications for

Previous and Future Research”

[3] Gunopulas et al.: Finding Similar Time Series

[4] Perng et al.: Landmarks: A New Model for Similarity-Based Pattern Querying in Time

Series Databases

[5] Clarkson et al.: Unsupervised Clustering of Ambulatory Audio and Video, in ICASSP'99

[6] Vlachos et al.: Indexing Multi-Dimensional Time-Series with Support for Multiple Distance

Measures, SIGKDD ‘03

[7] Aggarwal. et al.: A Framework for Clustering Evolving Data Streams

[8] Daevon et al.: Pseudo PCA for Cluster Analysis of Time Series Data

[9] Calinski et al.: Communications in Statistics (1974)

[10] Legendre: Program k-means User’s Guide, Universite de Montreal

[11] Ebdon: class notes (from Internet)

[12] MacQueen: "Some Methods for classification and Analysis of Multivariate Observations,

Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability",

Berkeley, University of California Press, 1:281-297 (1967)

[13] Wrigley: Speech Recognition by Dynamic Time Warping (1999)

[14] Keogh et al.: Indexing Multi-Dimensional Time-Series with Support for Multiple Distance

Measures (SIGDD 2003)

[15] Keogh: A Tutorial on Indexing and Mining Time Series Data, The XVII Brazilian Symposium

on Databases Gramado, Rio Grande do Sul, Brazil, 2002

[16] Barbara.: Requirements for Clustering Data Streams

[17] Han et al.: “Data mining. Concepts and Techniques”, Morgan Kaufman 2001

[18] Wang et al.: Clustering by Pattern Similarity in Large Data Sets

[19] Kahveci et al.: Similarity Searching for Multi-attribute Sequences

[20] Dunn (1973): "A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact

Well-Separated Clusters", Journal of Cybernetics 3: 32-57

[21] Bezdek (1981): "Pattern Recognition with Fuzzy Objective Function Algorithms", Plenum

Press, New York

and many Internet web sites.

APPENDIX

EXPERIMENTAL RESULTS – TABLES

| |EKG |Δ |temperature |Δ |NYSE |Δ |currency |Δ |SOI |Δ |EEG |Δ | |not initialized | |78 | |82 | |57 | |66 | |79 | |73 | | |initialized | | | | | | | | | | | | | | | |Euclidean |81 |3 |89 |9 |69 |21 |70 |6 |84 |6 |53 |-27 | | |DTW |89 |14 |94 |15 |78 |37 |82 |14 |93 |18 |60 |-18 | |Table 4: Percentege increase or decrease effects of centroid optimization on accuracy (defined as error rate)

EXPERIMENTAL RESULTS – GRAPHS

[pic]

Fig. 7 Using SPSS centroids position it was possible to visual and numeric evaluation of how accurate the clustering was. The above example depicts Temperature tim series, clustered using k-means with optimized k = 5 (DTW).

REMARKS

Chaos and Lyapunov exponent is a characteristic of dynamical systems which quantifies how fast (initially) close points - the nearest neighbors (NN) - diverge (or converge) while moving along their trajectories in the phase space of the system. Put another way, the LE is a measure of how the system is sensitive to initial conditions and, as a result, how far into the future one can predict its behavior: When LE > 0 (LE < 0), the NN diverge (converge) and with LE = 0 there is no change (as observed for ideally periodic time series, e.g. sinusoid). For a system to be chaotic the LE must be positive (though other conditions must be fulfilled as well).

Phase space reconstruction, aimed at “approximation” of the original phase space by the means of the available observable time series data, is achieved by applying the method of time delay, which is based theoretically on works of Takens and Mané. Time series data is often called a read-out variable x(t), whose values are captured in a sampling interval (. Defining tau as time delay and dim as embedding dimension, one can (re-)construct the vector-valued trajectory in the underlying phase space:

y(t) = [y1(t), y2(t), …, ydim(t)]

where y1(t) = x(t), y2(t) = x(t+tau*(), y3(t) = x(t+2*tau*(), … are vectors in the reconstructed phase space.

-----------------------

[pic]

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download