Answer all questions in the homework document with no plagiarism.

Homework 8

Answer the following questions: (

Don't use plagiarized sources. Get Your Custom Essay on

Just from $13/Page

Order Essay

0 point each)

1. Name at least one situation in which you would not want to use clustering based on SNN similarity or density.

2. Explain the difference between likelihood and probability.

3. Discuss the advantages and disadvantages of treating clustering as an optimization problem. Among other factors, consider efficiency, non-determinism, and whether an optimization-based approach captures all types of clusterings that are of interest.

4. Traditional K-means has a number of limitations, such as sensitivity to outliers

and difficulty in handling clusters of different sizes and densities, or with

non-globular shapes. Comment on the ability of fuzzy c-means to handle

these situations.

Table 8.1 lists the two nearest neighbors of four points. Calculate the SNN similarity between each pair of points using the definition of SNN similarity defined in Algorithm 8.10. The following is the SNN similarity matrix.

Data Mining
Cluster Analysis: Advanced Concepts
and Algorithms

Lecture Notes for Chapter 8

Introduction to Data Mining, 2nd Edition

Tan, Steinbach, Karpatne, Kumar

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›

Outline
Prototype-based
Fuzzy c-means
Mixture Model Clustering
Self-Organizing Maps
Density-based
Grid-based clustering
Subspace clustering
Graph-based
Chameleon
Jarvis-Patrick
Shared Nearest Neighbor (SNN)
Characteristics of Clustering Algorithms

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Hard (Crisp) vs Soft (Fuzzy) Clustering
Hard (Crisp) vs. Soft (Fuzzy) clustering
For soft clustering allow point to belong to more than one cluster
For K-means, generalize objective function

: weight with which object xi belongs to cluster
To minimize SSE, repeat the following steps:
Fix and determine w(cluster assignment)
Fixw and recompute
Hard clustering:w {0,1}

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Soft (Fuzzy) Clustering: Estimating Weights

SSE(x) is minimized when wx1 = 1, wx2 = 0

1
2
5
c1
c2
x

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Fuzzy C-means
Objective function

: weight with which object belongs to cluster
a power for the weight not a superscript and controls how “fuzzy” the clustering is
To minimize objective function, repeat the following:
Fix and determinew
Fixwand recompute
Fuzzy c-means clustering:w[0,1]
Bezdek, James C. Pattern recognition with fuzzy objective function algorithms. Kluwer Academic Publishers, 1981.

p: fuzzifier (p > 1)

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Fuzzy C-means

SSE(x) is minimized when wx1 = 0.9, wx2 = 0.1

1
2
5
c1
c2
x

SSE(x)

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Fuzzy C-means
Objective function:

Initialization: choose the weights wij randomly
Repeat:
Update centroids:
Update weights:

p: fuzzifier (p > 1)

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Fuzzy K-means Applied to Sample Data

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
An Example Application: Image Segmentation
Modified versions of fuzzy c-means have been used for image segmentation
Especially fMRI images (functional magnetic resonance images)
References
Gong, Maoguo, Yan Liang, Jiao Shi, Wenping Ma, and Jingjing Ma. “Fuzzy c-means clustering with local information and kernel metric for image segmentation.” Image Processing, IEEE Transactions on 22, no. 2 (2013): 573-584.

From left to right: original images, fuzzy c-means, EM, BCFCM
Ahmed, Mohamed N., Sameh M. Yamany, Nevin Mohamed, Aly A. Farag, and Thomas Moriarty. “A modified fuzzy c-means algorithm for bias field estimation and segmentation of MRI data.” Medical Imaging, IEEE Transactions on 21, no. 3 (2002): 193-199.

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Hard (Crisp) vs Soft (Probabilistic) Clustering
Idea is to model the set of data points as arising from a mixture of distributions
Typically, normal (Gaussian) distribution is used
But other distributions have been very profitably used
Clusters are found by estimating the parameters of the statistical distributions
Can use a k-means like algorithm, called the Expectation-Maximization (EM) algorithm, to estimate these parameters
Actually, k-means is a special case of this approach
Provides a compact representation of clusters
The probabilities with which point belongs to each cluster provide a functionality similar to fuzzy clustering.

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Probabilistic Clustering: Example
Informal example: consider
modeling the points that
generate the following
histogram.
Looks like a combination of two
normal (Gaussian) distributions
Suppose we can estimate the
mean and standard deviation of each normal distribution.
This completely describes the two clusters
We can compute the probabilities with which each point belongs to each cluster
Can assign each point to the
cluster (distribution) for which it
is most probable.

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Probabilistic Clustering: EM Algorithm
Initialize the parameters
Repeat
For each point, compute its probability under each distribution
Using these probabilities, update the parameters of each distribution
Until there is no change
Very similar to K-means
Consists of assignment and update steps
Can use random initialization
Problem of local minima
For normal distributions, typically use K-means to initialize
If using normal distributions, can find elliptical as well as spherical shapes.

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Probabilistic Clustering: Updating Centroids
Update formula for weights assuming an estimate for statistical parameters

Very similar to the fuzzy k-means formula
Weights are probabilities
Weights are not raised to a power
Probabilities calculated using Bayes rule:
Need to assign weights to each cluster
Weights may not be equal
Similar to prior probabilities
Can be estimated:

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
More Detailed EM Algorithm

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Probabilistic Clustering Applied to Sample Data

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Probabilistic Clustering: Dense and Sparse Clusters

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Problems with EM
Convergence can be slow
Only guarantees finding local maxima
Makes some significant statistical assumptions
Number of parameters for Gaussian distribution grows as O(d2), d the number of dimensions
Parameters associated with covariance matrix
K-means only estimates cluster means, which grow as O(d)

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Alternatives to EM
Method of moments / Spectral methods
ICML 2014 workshop bibliography
https://sites.google.com/site/momentsicml2014/bibliography
Markov chain Monte Carlo (MCMC)
Other approaches

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
SOM: Self-Organizing Maps
Self-organizing maps (SOM)
Centroid based clustering scheme
Like K-means, a fixed number of clusters are specified
However, the spatial relationship
of clusters is also specified,
typically as a grid
Points are considered one by
one
Each point is assigned to the
closest centroid
Other centroids are updated based
on their nearness to the closest centroid
Kohonen, Teuvo, and Self-Organizing Maps. “Springer series in information sciences.” Self-organizing maps 30 (1995).

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
SOM: Self-Organizing Maps

Updates are weighted by distance
Centroids farther away are affected less
The impact of the updates decreases with each time
At some point the centroids will not change much

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
SOM can be viewed as a type of dimensionality reduction
If a two-dimensional grid is used, the results can be visualized

SOM: Self-Organizing Maps

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
SOM Clusters of LA Times Document Data

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Another SOM Example: 2D Points

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Issues with SOM
Computational complexity
Locally optimal solution
Grid is somewhat arbitrary

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Grid-based Clustering
A type of density-based clustering

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Subspace Clustering
Until now, we found clusters by considering all of the attributes
Some clusters may involve only a subset of attributes, i.e., subspaces of the data
Example:
When k-means is used to find document clusters, the resulting clusters can typically be characterized by 10 or so terms

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Example

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Clique Algorithm – Overview
A grid-based clustering algorithm that methodically finds subspace clusters
Partitions the data space into rectangular units of equal volume
Measures the density of each unit by the fraction of points it contains
A unit is dense if the fraction of overall points it contains is above a user specified threshold, 
A cluster is a group of collections of contiguous (touching) dense units

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Clique Algorithm
It is impractical to check each volume unit to see if it is dense since there is exponential number of such units
Monotone property of density-based clusters:
If a set of points forms a density based cluster in k dimensions, then the same set of points is also part of a density based cluster in all possible subsets of those dimensions
Very similar to Apriori algorithm
Can find overlapping clusters

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Clique Algorithm

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Limitations of Clique
Time complexity is exponential in number of dimensions
Especially if “too many” dense units are generated at lower stages
May fail if clusters are of widely differing densities, since the threshold is fixed
Determining appropriate threshold and unit interval length can be challenging

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Denclue (DENsity CLUstering)
Based on the notion of kernel-density estimation
Contribution of each point to the density is given by an influence or kernel function

Overall density is the sum of the contributions of all points

Formula and plot of Gaussian Kernel

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Example of Density from Gaussian Kernel

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
DENCLUE Algorithm

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
DENCLUE Algorithm
Find the density function
Identify local maxima (density attractors)
Assign each point to the density attractor
Follow direction of maximum increase in density
Define clusters as groups consisting of points associated with density attractor
Discard clusters whose density attractor has a density less than a user specified minimum, 
Combine clusters connected by paths of points that are connected by points with density above 

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Graph-Based Clustering: General Concepts
Graph-Based clustering uses the proximity graph
Start with the proximity matrix
Consider each point as a node in a graph
Each edge between two nodes has a weight which is the proximity between the two points
Initially the proximity graph is fully connected
MIN (single-link) and MAX (complete-link) can be viewed in graph terms
In the simplest case, clusters are connected components in the graph.

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
CURE Algorithm: Graph-Based Clustering
Agglomerative hierarchical clustering algorithms vary in terms of how the proximity of two clusters are computed
MIN (single link)
susceptible to noise/outliers
MAX (complete link)/GROUP AVERAGE/Centroid/Ward’s:
may not work well with non-globular clusters
CURE algorithm tries to handle both problems

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›

Represents a cluster using multiple representative points
Representative points are found by selecting a constant number of points from a cluster
First representative point is chosen to be the point furthest from the center of the cluster
Remaining representative points are chosen so that they are farthest from all previously chosen points
CURE Algorithm

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
“Shrink” representative points toward the center of the cluster by a factor, 

Shrinking representative points toward the center helps avoid problems with noise and outliers

Cluster similarity is the similarity of the closest pair of representative points from different clusters
CURE Algorithm




02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
CURE Algorithm
Uses agglomerative hierarchical scheme to perform clustering;
 = 0: similar to centroid-based
 = 1: somewhat similar to single-link
CURE is better able to handle clusters of arbitrary shapes and sizes

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Experimental Results: CURE
Picture from CURE, Guha, Rastogi, Shim.

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Experimental Results: CURE

Picture from CURE, Guha, Rastogi, Shim.
(centroid)
(single link)

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
CURE Cannot Handle Differing Densities
Original Points

CURE

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Graph-Based Clustering: Chameleon
Based on several key ideas
Sparsification of the proximity graph
Partitioning the data into clusters that are relatively pure subclusters of the “true” clusters
Merging based on preserving characteristics of clusters

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Graph-Based Clustering: Sparsification
The amount of data that needs to be processed is drastically reduced
Sparsification can eliminate more than 99% of the entries in a proximity matrix
The amount of time required to cluster the data is drastically reduced
The size of the problems that can be handled is increased

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Graph-Based Clustering: Sparsification …
Clustering may work better
Sparsification techniques keep the connections to the most similar (nearest) neighbors of a point while breaking the connections to less similar points.
The nearest neighbors of a point tend to belong to the same class as the point itself.
This reduces the impact of noise and outliers and sharpens the distinction between clusters.
Sparsification facilitates the use of graph partitioning algorithms (or algorithms based on graph partitioning algorithms)
Chameleon and Hypergraph-based Clustering

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Sparsification in the Clustering Process

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Limitations of Current Merging Schemes
Existing merging schemes in hierarchical clustering algorithms are static in nature
MIN or CURE:
Merge two clusters based on their closeness (or minimum distance)
GROUP-AVERAGE:
Merge two clusters based on their average connectivity

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Limitations of Current Merging Schemes

Closeness schemes will merge (a) and (b)
(a)
(b)
(c)
(d)
Average connectivity schemes will merge (c) and (d)

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Chameleon: Clustering Using Dynamic Modeling
Adapt to the characteristics of the data set to find the natural clusters
Use a dynamic model to measure the similarity between clusters
Main properties are the relative closeness and relative inter-connectivity of the cluster
Two clusters are combined if the resulting cluster shares certain properties with the constituent clusters
The merging scheme preserves self-similarity

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Relative Interconnectivity

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Relative Closeness

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Chameleon: Steps
Preprocessing Step:
Represent the data by a Graph
Given a set of points, construct the k-nearest-neighbor (k-NN) graph to capture the relationship between a point and its k nearest neighbors
Concept of neighborhood is captured dynamically (even if region is sparse)
Phase 1: Use a multilevel graph partitioning algorithm on the graph to find a large number of clusters of well-connected vertices
Each cluster should contain mostly points from one “true” cluster, i.e., be a sub-cluster of a “real” cluster

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Chameleon: Steps …
Phase 2: Use Hierarchical Agglomerative Clustering to merge sub-clusters
Two clusters are combined if the resulting cluster shares certain properties with the constituent clusters
Two key properties used to model cluster similarity:
Relative Interconnectivity: Absolute interconnectivity of two clusters normalized by the internal connectivity of the clusters
Relative Closeness: Absolute closeness of two clusters normalized by the internal closeness of the clusters

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Experimental Results: CHAMELEON

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Experimental Results: CURE (10 clusters)

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Experimental Results: CURE (15 clusters)

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Experimental Results: CHAMELEON

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Experimental Results: CURE (9 clusters)

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Experimental Results: CURE (15 clusters)

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Experimental Results: CHAMELEON

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›

i
j

4
Shared Nearest Neighbor (SNN) graph: the weight of an edge is the number of shared neighbors between vertices given that the vertices are connected
Graph-Based Clustering: SNN Approach

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Creating the SNN Graph
Sparse Graph
Link weights are similarities between neighboring points

Shared Near Neighbor Graph
Link weights are number of Shared Nearest Neighbors

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Jarvis-Patrick Clustering
First, the k-nearest neighbors of all points are found
In graph terms this can be regarded as breaking all but the k strongest links from a point to other points in the proximity graph

A pair of points is put in the same cluster if
any two points share more than T neighbors and
the two points are in each others k nearest neighbor list
For instance, we might choose a nearest neighbor list of size 20 and put points in the same cluster if they share more than 10 near neighbors
Jarvis-Patrick clustering is too brittle

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
When Jarvis-Patrick Works Reasonably Well
Original Points

Jarvis Patrick Clustering
6 shared neighbors out of 20

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Smallest threshold, T, that does not merge clusters.

Threshold of T – 1
When Jarvis-Patrick Does NOT Work Well

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
SNN Density-Based Clustering
Combines:
Graph based clustering (similarity definition based on number of shared nearest neighbors)
Density based clustering (DBSCAN-like approach)
SNN density measures whether a point is surrounded by similar points (with respect to its nearest neighbors)

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
SNN Clustering Algorithm
Compute the similarity matrix
This corresponds to a similarity graph with data points for nodes and edges whose weights are the similarities between data points
Sparsify the similarity matrix by keeping only the k most similar neighbors
This corresponds to only keeping the k strongest links of the similarity graph
Construct the shared nearest neighbor graph from the sparsified similarity matrix.
At this point, we could apply a similarity threshold and find the connected components to obtain the clusters (Jarvis-Patrick algorithm)
Find the SNN density of each Point.
Using a user specified parameters, Eps, find the number points that have an SNN similarity of Eps or greater to each point. This is the SNN density of the point

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
SNN Clustering Algorithm …
Find the core points
Using a user specified parameter, MinPts, find the core points, i.e., all points that have an SNN density greater than MinPts
Form clusters from the core points
If two core points are within a “radius”, Eps, of each other they are placed in the same cluster
Discard all noise points
All non-core points that are not within a “radius” of Eps of a core point are discarded
Assign all non-noise, non-core points to clusters
This can be done by assigning such points to the nearest core point
(Note that steps 4-8 are DBSCAN)

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
SNN Density

a) All Points b) High SNN Density
c) Medium SNN Density d) Low SNN Density

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
SNN Clustering Can Handle Differing Densities
Original Points

SNN Clustering

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
SNN Clustering Can Handle Other Difficult Situations

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Finding Clusters of Time Series In Spatio-Temporal Data

SNN Clusters of SLP.
SNN Density of Points on the Globe.

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Limitations of SNN Clustering
Does not cluster all the points
Complexity of SNN Clustering is high
O( n * time to find numbers of neighbor within Eps)
In worst case, this is O(n2)
For lower dimensions, there are more efficient ways to find the nearest neighbors
R* Tree
k-d Trees
Parameteriziation is not easy

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Characteristics of Data, Clusters, and
Clustering Algorithms
A cluster analysis is affected by characteristics of
Data
Clusters
Clustering algorithms
Looking at these characteristics gives us a number of dimensions that you can use to describe clustering algorithms and the results that they produce

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
High dimensionality
Size of data set
Sparsity of attribute values
Noise and Outliers
Types of attributes and type of data sets
Differences in attribute scales
Properties of the data space
Can you define a meaningful centroid
Characteristics of Data

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Data distribution
Shape
Differing sizes
Differing densities
Poor separation
Relationship of clusters
Types of clusters
Center-based, contiguity-based, density-based
Subspace clusters
Characteristics of Clusters

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Order dependence
Non-determinism
Parameter selection
Scalability
Underlying model
Optimization based approach
Characteristics of Clustering Algorithms

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
We assume EM clustering using the Gaussian (normal) distribution.
MIN is hierarchical, EM clustering is partitional.
Both MIN and EM clustering are complete.
MIN has a graph-based (contiguity-based) notion of a cluster, while EM clustering has a prototype (or model-based) notion of a cluster.
MIN will not be able to distinguish poorly separated clusters, but EM can manage this in many situations.
MIN can find clusters of different shapes and sizes; EM clustering prefers globular clusters and can have trouble with clusters of different sizes.
Min has trouble with clusters of different densities, while EM can often handle this.
Neither MIN nor EM clustering finds subspace clusters.
Comparison of MIN and EM-Clustering

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
MIN can handle outliers, but noise can join clusters; EM clustering can tolerate noise, but can be strongly affected by outliers.
EM can only be applied to data for which a centroid is meaningful; MIN only requires a meaningful definition of proximity.
EM will have trouble as dimensionality increases and the number of its parameters (the number of entries in the covariance matrix) increases as the square of the number of dimensions; MIN can work well with a suitable definition of proximity.
EM is designed for Euclidean data, although versions of EM clustering have been developed for other types of data. MIN is shielded from the data type by the fact that it uses a similarity matrix.
MIN makes no distribution assumptions; the version of EM we are considering assumes Gaussian distributions.
Comparison of MIN and EM-Clustering

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
EM has an O(n) time complexity; MIN is O(n2log(n)).
Because of random initialization, the clusters found by EM can vary from one run to another; MIN produces the same clusters unless there are ties in the similarity matrix.
Neither MIN nor EM automatically determine the number of clusters.
MIN does not have any user-specified parameters; EM has the number of clusters and possibly the weights of the clusters.
EM clustering can be viewed as an optimization problem; MIN uses a graph model of the data.
Neither EM or MIN are order dependent.
Comparison of MIN and EM-Clustering

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
Both are partitional.
K-means is complete; DBSCAN is not.
K-means has a prototype-based notion of a cluster; DB uses a density-based notion.
K-means can find clusters that are not well-separated. DBSCAN will merge clusters that touch.
DBSCAN handles clusters of different shapes and sizes; K-means prefers globular clusters.
Comparison of DBSCAN and K-means

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
DBSCAN can handle noise and outliers; K-means performs poorly in the presence of outliers
K-means can only be applied to data for which a centroid is meaningful; DBSCAN requires a meaningful definition of density
DBSCAN works poorly on high-dimensional data; K-means works well for some types of high-dimensional data
Both techniques were designed for Euclidean data, but extended to other types of data
DBSCAN makes no distribution assumptions; K-means is really assuming spherical Gaussian distributions
Comparison of DBSCAN and K-means

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
K-means has an O(n) time complexity; DBSCAN is O(n^2)
Because of random initialization, the clusters found by K-means can vary from one run to another; DBSCAN always produces the same clusters
DBSCAN automatically determines the number of clusters; K-means does not
K-means has only one parameter, DBSCAN has two.
K-means clustering can be viewed as an optimization problem and as a special case of EM clustering; DBSCAN is not based on a formal model.
Comparison of DBSCAN and K-means

02/14/2018 Introduction to Data Mining, 2nd Edition ‹#›
1
1
=
å
=
k
j
ij
w
2
1
2
2
2
1
9
)
2
5
(
)
1
2
(
)
(
x
x
x
x
w
w
w
w
x
SSE
+
=
–
+
–
=
1
1
=
å
=
k
j
ij
w
2
2
2
1
2
2
2
2
2
1
9
)
2
5
(
)
1
2
(
)
(
x
x
x
x
w
w
w
w
x
SSE
+
=
–
+
–
=
maximum
membership
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
å
å
=
=
=
m
i
i
j
m
i
i
j
i
j
C
p
C
p
1
1
)
|
(
/
)
|
(
x
x
x
c
å
=
=
m
i
i
j
j
C
p
m
C
p
1
)
|
(
1
)
(
x
maximum
probability
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
-10
-8
-6
-4
-2
0
2
4
-8
-6
-4
-2
0
2
4
6
x
y
26 SLP Clusters via Shared Nearest Neighbor Clustering (100 NN, 1982-1994)
longitude
latitude
-180
-150
-120
-90
-60
-30
0
30
60
90
120
150
180
90
60
30
0
-30
-60
-90
13
26
24
25
22
14
16
20
17
18
19
15
23
1
9
6
4
7
10
12
11
3
5
2
8
21
SNN Density of SLP Time Series Data
longitude
latitude
-180
-150
-120
-90
-60
-30
0
30
60
90
120
150
180
90
60
30
0
-30
-60
-90

What Will You Get?

We provide professional writing services to help you score straight A’s by submitting custom written assignments that mirror your guidelines.

Premium Quality

Get result-oriented writing and never worry about grades anymore. We follow the highest quality standards to make sure that you get perfect assignments.

Experienced Writers

Our writers have experience in dealing with papers of every educational level. You can surely rely on the expertise of our qualified professionals.

On-Time Delivery

Your deadline is our threshold for success and we take it very seriously. We make sure you receive your papers before your predefined time.

24/7 Customer Support

Someone from our customer support team is always here to respond to your questions. So, hit us up if you have got any ambiguity or concern.

Complete Confidentiality

Sit back and relax while we help you out with writing your papers. We have an ultimate policy for keeping your personal and order-related details a secret.

Authentic Sources

We assure you that your document will be thoroughly checked for plagiarism and grammatical errors as we use highly authentic and licit sources.

Moneyback Guarantee

Still reluctant about placing an order? Our 100% Moneyback Guarantee backs you up on rare occasions where you aren’t satisfied with the writing.

Order Tracking

You don’t have to wait for an update for hours; you can track the progress of your order any time you want. We share the status after each step.

Order Now Talk to Us

Areas of Expertise

Although you can leverage our expertise for any writing task, we have a knack for creating flawless papers for the following document types.

Essay

Thesis

Presentation

Dissertation

Term Paper

Research Paper

Book Review

Assignment

Report

Case Study

Letter

Article

Coursework

Speech

Q & A

Critical Thinking

Areas of Expertise

Although you can leverage our expertise for any writing task, we have a knack for creating flawless papers for the following document types.

Essay

Thesis

Presentation

Dissertation

Term Paper

Research Paper

Book Review

Assignment

Report

Case Study

Letter

Article

Coursework

Speech

Q & A

Critical Thinking

Trusted Partner of 9650+ Students for Writing

From brainstorming your paper's outline to perfecting its grammar, we perform every step carefully to make your paper worthy of A grade.

Preferred Writer

Hire your preferred writer anytime. Simply specify if you want your preferred expert to write your paper and we’ll make that happen.

Grammar Check Report

Get an elaborate and authentic grammar check report with your work to have the grammar goodness sealed in your document.

One Page Summary

You can purchase this feature if you want our writers to sum up your paper in the form of a concise and well-articulated summary.

Plagiarism Report

You don’t have to worry about plagiarism anymore. Get a plagiarism report to certify the uniqueness of your work.

Free Features $66FREE

Most Qualified Writer $10FREE
Plagiarism Scan Report $10FREE
Unlimited Revisions $08FREE
Paper Formatting $05FREE
Cover Page $05FREE
Referencing & Bibliography $10FREE
Dedicated User Area $08FREE
24/7 Order Tracking $05FREE
Periodic Email Alerts $05FREE

Our Services

Join us for the best experience while seeking writing assistance in your college life. A good grade is all you need to boost up your academic excellence and we are all about it.

On-time Delivery
24/7 Order Tracking
Access to Authentic Sources

Academic Writing

We create perfect papers according to the guidelines.

Professional Editing

We seamlessly edit out errors from your papers.

Thorough Proofreading

We thoroughly read your final draft to identify errors.

Thorough Proofreading

We thoroughly read your final draft to identify errors.

Delegate Your Challenging Writing Tasks to Experienced Professionals

Work with ultimate peace of mind because we ensure that your academic work is our responsibility and your grades are a top concern for us!

Check Out Our Sample Work

Dedication. Quality. Commitment. Punctuality

It May Not Be Much, but It’s Honest Work!

Here is what we have achieved so far. These numbers are evidence that we go the extra mile to make your college journey successful.

0+

Happy Clients

0+

Words Written This Week

0+

Ongoing Orders

0%

Customer Satisfaction Rate

Process as Fine as Brewed Coffee

We have the most intuitive and minimalistic process so that you can easily place an order. Just follow a few steps to unlock success.

Call Us +1 (877) 657-8180 Discuss Order Details Now

See How We Helped 9000+ Students Achieve Success

We Analyze Your Problem and Offer Customized Writing

We understand your guidelines first before delivering any writing service. You can discuss your writing needs and we will have them evaluated by our dedicated team.

Clear elicitation of your requirements.
Customized writing as per your needs.

We Mirror Your Guidelines to Deliver Quality Services

We write your papers in a standardized way. We complete your work in such a way that it turns out to be a perfect description of your guidelines.

Proactive analysis of your writing.
Active communication to understand requirements.

We Handle Your Writing Tasks to Ensure Excellent Grades

We promise you excellent grades and academic excellence that you always longed for. Our writers stay in touch with you via email.

Thorough research and analysis for every order.
Deliverance of reliable writing service to improve your grades.

Place an Order Start Chat Now

Answer all questions in the homework document with no plagiarism.

What Will You Get?

Premium Quality

Experienced Writers

On-Time Delivery

24/7 Customer Support

Complete Confidentiality

Authentic Sources

Moneyback Guarantee

Order Tracking

Areas of Expertise

Essay

Thesis

Presentation

Dissertation

Term Paper

Research Paper

Book Review

Assignment

Report

Case Study

Letter

Article

Coursework

Speech

Q & A

Critical Thinking

Areas of Expertise

Essay

Thesis

Presentation

Dissertation

Term Paper

Research Paper

Book Review

Assignment

Report

Case Study

Letter

Article

Coursework

Speech

Q & A

Critical Thinking

Trusted Partner of 9650+ Students for Writing

Preferred Writer

Grammar Check Report

One Page Summary

Plagiarism Report

Free Features $66FREE

Our Services

Academic Writing

Professional Editing

Thorough Proofreading

Thorough Proofreading

Delegate Your Challenging Writing Tasks to Experienced Professionals

Check Out Our Sample Work

It May Not Be Much, but It’s Honest Work!

0+

Happy Clients

0+

Words Written This Week

0+

Ongoing Orders

0%

Customer Satisfaction Rate

Process as Fine as Brewed Coffee

Share Your Requirements

Place Order & Deposit Funds

Release Payment to Your Writer

See How We Helped 9000+ Students Achieve Success

We Analyze Your Problem and Offer Customized Writing

We Mirror Your Guidelines to Deliver Quality Services

We Handle Your Writing Tasks to Ensure Excellent Grades