1. The following attributes are measured for members of a herd of Asian elephants: weight, height, tusk length, trunk length, and ear area. Based on these measurements, what sort of similarity measure (from 1st PDF) (measure of similarity and dissimilarity) would you use to compare or group these elephants? Justify your answer and explain any special circumstances.
2. Consider the training examples shown in 2nd PDF for a binary classification problem.
(a) Compute the Gini index for the overall collection of training examples.
(b) Compute the Gini index for the Customer ID attribute.
(c) Compute the Gini index for the Gender attribute.
(d) Compute the Gini index for the Car Type attribute using multiway split.
3. Consider the data set shown in Table 3rd PDF
(a) Estimate the conditional probabilities for P(A|+), P(B|+), P(C|+), P(A|-), P(B|-), and P(C|-).
(b) Use the estimate of conditional probabilities given in the previous question to predict the class label for a test sample (A = 0, B = 1, C = 0) using the naive Bayes approach.
(c) Estimate the conditional probabilities using the m-estimate approach, with p = 1/2 and m = 4.
APA FormatDue in 16 hours 3 REFERENCES
Table 4.7. Data set for Exercise 2.
Customer ID Gender Car Type Shirt Size Class
1
2
3
4
b
6
F7
I
8
9
1 0
1 1
t 2
1 3
t 4
1 5
1 6
7 7
1 8
1 9
20
M
M
M
M
M
M
F
F
F
F
M
M
M
M
F
F
F
F
F
F
Family
Sports
Sports
Sports
Sports
Sports
Sports
Sports
Sports
Luxury
Family
Family
Family
Luxury
Luxury
Luxury
Luxury
Luxury
Luxury
Luxury
Small
Medium
Medium
Large
Extra Large
Extra Large
Small
Small
Medium
Large
Large
Extra Large
Medium
Extra Large
Small
Small
Medium
Medium
Medium
Larqe
CO
CO
CO
CO
CO
CO
CO
CO
CO
CO
C 1
C 1
C 1
C 1
C 1
C 1
C 1
C 1
C 1
C 1
4.8 Exercises 199
Table 4,8. Data set for Exercise 3.
Instance aL a2 as Target Class
I
2
3
4
5
6
a
T
8
o
T T 1 . O
T T 6 . 0
T F 5 . O
F F 4 . O
F T 7 . 0
F T 3 . O
F F 8 . O
T F 7 . 0
F T 5 . O
T
-T
-T-
-T-
(b)
(“)
What are the information gains of o1 and o2 relative to these training
examples?
For o3, which is a continuous attribute, compute the information gain for
every possible split.
Table 5.10. Data set for Exercise 7.
Record A B C Class
1
2
e
t
o
7
8
9
10
0
0
0
0
0
1
1
1
1
I
0
0
I
1
0
0
0
0
1
0
0
1
1
I
1
1
1
I
1
1
+
+
+
+
+
318 Chapter 5 Classification: Alternative Techniques
(b)
Given the information in part (a), is a randomly chosen college student
more likely to be a graduate or undergraduate student?
(c) Repeat part (b) assuming that the student is a smoker.
(d) Suppose 30% of the graduate students live in a dorm but only l0% of
the undergraduate students live in a dorm. If a student smokes and lives
in the dorm, is he or she more likely to be a graduate or undergraduate
student? You can assume independence between students who live in a
dorm and those who smoke.
7. Consider the data set shown in Table 5.10
Estimate the conditional probabilities for P(Al-l_), P(Bi+), P(Cl+), P(Al-),
P(Bl-) , and P(Cl-) .
Use the estimate of conditional probabilities given in the previous question
to predict the class label for a test sample (A:0,8 – I,C :0) using
the naive Bayes approach.
(c) Estimate the conditional probabilities using the m-estimate approach,
wi th p : I /2 and m:4.
(d) Repeat part (b) using the conditional probabilities given in part (c).
(e) Compare the two methods for estimating probabilities. Which method is
better and why?
8. Consider the data set shown in Table 5.11.
(a) Estimate the conditional probabilities for P(A : 1l+), P(B : 11a),
P(C : 1 l+) , P( .4 : 1 l – ) , P(B : 1 l – ) , and P(C : 1 l – ) us ing the
same approach as in the previous problem.
(a)
(b)
Measures of Similaritv and Dissimilaritv 65
the comparison between people will be dominated by differences in income. In
particular, if the similarity or dissimilarity of two people is calculated using the
similarity or dissimilarity measures defined later in this chapter, then in many
cases, such as that of Euclidean distance, the income values will dominate the
calculation.
The mean and standard deviation are strongly affected by outliers, so the
above transformation is often modified. First, the mean is replaced by the
median, i.e., the middle value. Second, the standard deviation is replaced
by
the absolute standard deviation. Specifically, if r is a variable, then the
absolute standard deviation of r is given by oa : Dlrl*n – ltl, where ri is
lhe ‘ith value of the variable, rn is the number of objects, and. p, is either the
mean or median. Other approaches for computing estimates of the location
(center) and spread of a set of values in the presence of outliers are described
in Sections 3.2.3 and 3.
2.4
, respectively. These measures can also be used to
define a standardi zation transformation.
2.4 Measures of Similarity and Dissimilarity
Similarity and dissimilarity are important because they are used by a number
of data mining techniques, such as clustering, nearest neighbor classification,
and anomaly detection. In many cases) the initial data set is not needed once
these similarities or dissimilarities have been computed. Such approaches can
be viewed as transforming the data to a similarity (dissimilarity) space and
then performing the analysis.
We begin with a discussion of the basics: high-level definitions of similarity
and dissimilarity, and a discussion of how they are related. For convenience,
the term proximity is used to refer to either similarity or dissimilarity. Since
the proximity between two objects is a function of the proximity between the
corresponding attributes of the two objects, we first describe how to measure
the proximity between objects having only one simple attribute, and then
consider proximity measures for objects with multiple attributes. This in-
cludes measures such as correlation and Euclidean distance, which are useful
for dense data such as time series or two-dimensional points, as well as the
Jaccard and cosine similarity measures, which are useful for sparse data like
documents. Next, we consider several important issues concerning proximity
measures. The section concludes with a brief discussion of how to select the
right proximity measure.
2.4
66 Chapter 2 Data
2.4.L Basics
Definitions
Informally, the similarity between two objects is a numerical measure of the
degree to which the two objects are alike. Consequently, similarities are hi,gher
for pairs of objects that are more alike. Similarities are usually non-negative
and are often between 0 (no similarity) and 1 (complete similarity).
The dissimilarity between two objects is a numerical measure of the de-
gree to which the two objects are different. Dissimilarities are lower for more
similar pairs of objects. FYequently, the term distance is used as a synonym
for dissimilarity, although, as we shall see, distance is often used to refer to
a special class of dissimilarities. Dissimilarities sometimes fall in the interval
[0,1], but it is also common for them to range from 0 to oo.
TYansformations
Thansformations are often applied to convert a similarity to a dissimilarity,
or vice versa, or to transform a proximity measure to fall within a particular
range, such as [0,1]. For instance, we may have similarities that range from 1
to 10, but the particular algorithm or software package that we want to use
may be designed to only work with dissimilarities, or it may only work with
similarities in the interval 10,1]. We discuss these issues here because we will
employ such transformations later in our discussion of proximity. In addi-
tion, these issues are relatively independent of the details of specific proximity
measures.
Frequently, proximity measures, especially similarities, are defined or trans-
formed to have values in the interval [0,1]. Informally, the motivation for this
is to use a scale in which a proximity value indicates the fraction of similarity
(or dissimilarity) between two objects. Such a transformation is often rela-
tively straightforward. For example, if the similarities between objects range
from 1 (not at all similar) to 10 (completely similar), we can make them fall
within the range 10, 1] by using the transformation s’ : (s – 1)/9, where s and
s/ are the original and new similarity values, respectively. In the more general
case, the transformation of similarities to the interval [0,1] is given by the
expression
“‘
: (” – mi,n-s) I (mar -s – mi,n-s) , where mar -s and m’in-s are the
maximum and minimum similarity values, respectively. Likewise, dissimilarity
measures with a finite range can be mapped to the interval [0,1] by using the
formula d’ : (d – rni,n-d)l(mar-d – mi,n-d).
There can be various complications in mapping proximity measures to the
interval 10,1], however. If, for example, the proximity measure originally takes
Measures of Similaritv and Dissimilaritv 67
values in the interval [0,-], then a non-linear transformation is needed and
values will not have the same relationship to one another on the new scale.
Consider the transformation d,’ : dl(I* d) for a dissimilarity measure that
ranges from 0 to oo. The dissimilarities 0, 0.5, 2, 10, 100, and 1000 will be
transformed into the new dissimilarities 0, 0.33, 0.67, 0.90, 0.99, and 0.999,
respectively. Larger values on the original dissimilarity scale are compressed
into the range of values near 1, but whether or not this is desirable depends on
the application. Another complication is that the meaning of the proximity
measure may be changed. For example, correlation, which is discussed later,
is a measure of similarity that takes values in the interval [-1,1]. Mapping
these values to the interval [0,1] by taking the absolute value loses information
about the sign, which can be important in some applications. See Exercise 22
on page 94.
Tlansforming similarities to dissimilarities and vice versa is also relatively
straightforward, although we again face the issues of preserving meaning and
changing a linear scale into a non-linear scale. If the similarity (or dissimilar-
ity) falls in the interval [0,1], then the dissimilarity can be defined as d : 1- s
(s : 1 – d). Another simple approach is to define similarity as the nega-
tive of the dissimilarity (or vice versa). To illustrate, the dissimilarities 0, 1,
10, and 100 can be transformed into the similarities 0, -1, -10, and -100,
respectively.
The similarities resulting from the negation transformation are not re-
stricted to the range [0,1], but if that is desired, then transformations such as
”
: ;{1, s : e-d, or s : 1- *mZ can be used. For the transformation
s : 7fi, the dissimilarities 0, 1, 10, 100 are transformed into 1, 0.5, 0.09, 0.01,
respectively. For s : e-d, they become 1.00, 0.37, 0.00, 0.00, respectively,
while for s: 1- *ffin they become 1.00, 0.99, 0.00, 0.00, respectively.
In this discussion, we have focused on converting dissimilarities to similarities.
Conversion in the opposite direction is considered in Exercise 23 on page 94.
In general, any monotonic decreasing function can be used to convert dis-
similarities to similarities, or vice versa. Of course, other factors also must
be considered when transforming similarities to dissimilarities, or vice versa,
or when transforming the values of a proximity measure to a new scale. We
have mentioned issues related to preserving meaning, distortion of scale, and
requirements of data analysis tools, but this list is certainly not exhaustive.
2.4.2 Similarity and Dissimilarity between Simple Attributes
The proximity of objects with a number of attributes is typically defined by
combining the proximities of individual attributes, and thus, we first discuss
2.4
68 Chapter 2 Data
proximity between objects having a single attribute. Consider objects de-
scribed by one nominal attribute. What would it mean for two such objects
to be similar? Since nominal attributes only convey information about the
distinctness of objects, all we can say is that two objects either have the same
value or they do not. Hence, in this case similarity is traditionally defined as 1
if attribute values match, and as 0 otherwise. A dissimilarity would be defined
in the opposite way: 0 if the attribute values match, and 1 if they do not.
For objects with a single ordinal attribute, the situation is more compli-
cated because information about order should be taken into account. Consider
an attribute that measures the quality of a product, €.8., a candy bar, on the
scale {poor, fai,r, OK, good, wonderful).It would seem reasonable that a prod-
uct, P1, which is rated wonderful, would be closer to a produclP2, which is
rated good, than it would be to a product P3, which is rated OK. To make this
observation quantitative, the values of the ordinal attribute are often mapped
to successive integers, beginning at 0 or 1, e.g., {poor:O, fair:|, OK:2,
good:3, wonderful:4). Then, d(Pl,P2) – 3 – 2 : 7 or, if we want the dis-
similarity to fall between 0 and 1, d(P1, P2) : TZ :0.25. A similarity for
ordinal attributes can then be defined as s : 7 – d.
This definition of similarity (dissimilarity) for an ordinal attribute should
make the reader a bit uneasy since this assumes equal intervals, and this is not
so. Otherwise, we would have an interval or ratio attribute. Is the difference
between the values fair and good really the same as that between the values
OK and wonderful? Probably not, but in practice, our options are limited,
and in the absence of more information, this is the standard approach for
defining proximity between ordinal attributes.
For interval or ratio attributes, the natural measure of dissimilarity be-
tween two objects is the absolute difference of their values. For example, we
might compare our current weight and our weight a year ago by saying “I am
ten pounds heavier.” In cases such as these, the dissimilarities typically range
from 0 to oo, rather than from 0 to 1. The similarity of interval or ratio at-
tributes is typically expressed by transforming a similarity into a dissimilarity,
as previously described.
Table 2.7 summarizes this discussion. In this table, r and g are two objects
that have one attribute of the indicated type. AIso, d(r,a) and s(r,gr) are the
dissimilarity and similarity between r and g, respectively. Other approaches
are possible; these are the most common ones.
The following two sections consider more complicated measures of prox-
imity between objects that involve multiple attributes: (1) dissimilarities be-
tween data objects and (2) similarities between data objects. This division
2.4 Measures of Similaritv and Dissimilaritv 69
allows us to more naturally display the underlying motivations for employing
various proximity measures. We emphasize, however, that similarities can be
transformed into dissimilarities and vice versa using the approaches described
earlier.
2.4.3 Dissimilarities between Data Objects
In this section, we discuss various kinds of dissimilarities. We begin with a
discussion of distances, which are dissimilarities with certain properties, and
then provide examples of more general kinds of dissimilarities.
Distances
We first present some examples, and then offer a more formal description of
distances in terms of the properties common to all distances. The Euclidean
distance, d, between two points, x and y, in one-, two-, three-, or higher-
dimensional space, is given by the following familiar formula:
d(*, y) :
n
\ – r \ t
) \ tn
– ak ) ‘ ,
K = l
(2 .1 )
where n is the number of dimensions and rp and.Ak are) respectively,the kth
attributes (components) of r and g. We illustrate this formula with Figure
2.15 and Tables 2.8 and 2.9, which show a set of points, the e and gr coordinates
of these points, and the distance matrix containing the pairwise distances
of these points.
Table 2.7 . Similarity and dissimilarity for simple attributes
Attribute
T’ype
Dissimilarity Similaritv
Nominal ) _11, – 0 1 f r : A
I i f r l y s :
I \ f r : y
0 i f n l y
Ordinal
d: l r -a l l ( ” – t )
(values mapped to integers 0 to n-1
where n is the number of values)
s : I – d
Interval or Ratio d : l r – a l s : – d , s : ; i , s : e – o ,’ L t a
^ – 1 d -min-d-
70 Chapter 2 Data
The Euclidean distance measure given
the Minkowski distance metric shown in
in Equation 2.1 is generalized
Equation 2.2,
by
d(x, y) : ( , , \
where r is a parameter. The following are the three most common examples
of Minkowski distances.
. r :1. City block (Manhattan, taxicab, L1 norm) distance. A common
example is the Hamming distance, which is the number of bits that
are different between two objects that have only binary attributes, i.e.,
between two binary vectors.
o r :2. Euclidean distance (L2 norm).
. r : oo. Supremum (L*o, or L- norm) distance. This is the maximum
difference between any attribute of the objects. More formally, the L-
distance is defined by Equation 2.3
(2 .3)
(Lor-orr)””
J* (U’wr-rrt’)”‘d(*, Y) :
The r parameter should not be confused with the number of dimensions (at-
tributes) n. The Euclidean, Manhattan, and supremum distances are defined
for all values of n: I,2,3,…, and specify different ways of combining the
differences in each dimension (attribute) into an overall distance.
Tables 2.10 and 2.11, respectively, give the proximity matrices for the L1
and Loo distances using data from Table 2.8. Notice that all these distance
matrices are symmetric; i.e., the ijth entry is the same as the jith entry. In
Table 2.9, for instance, the fourth row of the first column and the fourth
column of the first row both contain the value 5.1.
Distances, such as the Euclidean distance, have some well-known proper-
ties. If d(*, y) is the distance between two points, x and y, then the following
properties hold.
1. Positivity
(a) d(x, x) > 0 for all x and y,
(b) d(x, Y) : 0 onlY if x : Y.
Measures of Similaritv and Dissimilaritv 7L
Figure 2.15. Four two-dimensional points.
Tabfe 2.8. r and y coordinates of four points. Table 2.9. Euclidean distance matrix for Table 2.g.
point z coordinate y coordinate
p1 0 2
p2 2 0
p3 .) 1
p4 1
Table 2.10. L1 distance matrix for Table 2.8. Table 2,11. L- distance matrix for Table 2.8.
2. Symmetry
d(*,Y) : d(Y,x) for al l x and Y.
3. T[iangle Inequality
d(x,z) < d(*, y) + d(y, z) for all points x, y,, and z.
Measures that satisfy all three properties are known as metrics. Some
people only use the term distance for dissimilarity measures that satisfy these
properties, but that practice is often violated. The three properties described
here are useful, as well as mathematically pleasing. AIso, if the triangle in-
equality holds, then this property can be used to increase the efficiency of tech-
niques (including clustering) that depend on distances possessing this property.
(See Exercise 25.) Nonetheless, many dissimilarities do not satisfy one or more
of the metric properties. We give two examples of such measures.
2.4
1 2 3 4 5 6
X
p l p2 p3 p4
pl 0.0 2 .8 3 .2 5 . 1
p2 2.8 0.0 t .4 3 .2
p3 3.2 L .4 0.0 2.0
p4 c . r 3.2 2 .0 0 .0
L1 p1 p2 p3 p4
p l 0.0 4 .0 4.0 6.0
p2 4.0 0.0 2.0 4.0
p3 4.0 2 .0 0.0 2 .0
p4 6.0 4.0 2 .0 0.0
L- p l p2 p3 p4
p1 0.0 2 .0 3.0 5.0
p2 2.0 0 .0 1 .0 3.0
p3 3.0 1 .0 0.0 2.0
p4 5.0 3 .0 2.0 0 .0
72 Chapter 2 Data
Example 2.L4 (Non-metric Dissimilarities: Set Differences). This ex-
ample is based on the notion of the difference of two sets, as defined in set
theory. Given two sets -4 and B, A – B is the set of elements of A that are
not in B . For example , i f A : { I ,2 ,3 ,4 } and B : {2 ,3 ,4 } , then A- B : {1 }
and B – A – 0, the empty set. We can define the distance d between two
sets A and B as d(A,B): size(A- B), where s’ize ts a function returning
the number of elements in a set. This distance measure, which is an integer
value greater than or equal to 0, does not satisfy the second part of the pos-
itivity property the symmetry property, or the triangle inequality. However,
these properties can be made to hold if the dissimilarity measure is modified
as fol lows: d(A,B): s ize(A- B) + si ,ze(B – A). See Exercise 21 on page
94. r
Example 2.15 (Non-metric Dissimilarities: Time). This example gives
a more everyday example of a dissimilarity measure that is not a metric, but
that is still useful. Define a measure of the distance between times of the day
as follows:
d,(t1,t2) : {7^ *t,az _ tr) l i l l i I:}
To illustrate, d(lPM, 2PM) : t hour, while d(2PM, 1PM) : 23 hours.
Such a definition would make sense, for example, when answering the question:
“ff an event occurs at lPM every day, and it is now 2PM, how Iong do I have
to wait for that event to occur again?”
2.4.4 Similarities between Data Objects
For similarities, the triangle inequality (or the analogous property) typically
does not hold, but symmetry and positivity typically do. To be explicit, if
s(x, y) is the similarity between points x and y, then the typical properties of
similarities are the following:
1 . s (x ,y ) : 1 on ly i f x : y . (0 < s S 1)
2. s(x,y) : s(y, x) for all x and y. (Symmetry)
There is no general analog of the triangle inequality for similarity mea-
sures. It is sometimes possible, however, to show that a similarity measure
can easily be converted to a metric distance. The cosine and Jaccard similarity
measures, which are discussed shortly, are two examples. Also, for specific sim-
ilarity measures, it is possible to derive mathematical bounds on the similarity
between two objects that are similar in spirit to the triangle inequality.
(2.4)
Measures of Similarity and 73
Example 2.L6 (A Non-symmetric Similarity Measure). Consider an
experiment in which people are asked to classify a small set of characters as
they flash on a screen. The confusion matrix for this experiment records how
often each character is classified as itself, and how often each is classified as
another character. For instance, suppose that “0” appeared 200 times and was
classified as a “0” 160 times, but as an “o” 40 times. Likewise, suppose that
‘o’ appeared 200 times and was classified as an “o” 170 times, but as “0” only
30 times. If we take these counts as a measure of the similarity between two
characters, then we have a similarity measure, but one that is not symmetric.
In such situations, the similarity measure is often made symmetric by setting
s'(x, y) : s'(y, x) : (s(x, y)+ s(y,
“D 12,
where s/ indicates the new similarity
measure,
2.4 .5
This section provides specific examples of some similarity and dissimilarity
measures.
Similarity Measures for Binary Data
Similarity measures between objects that contain only binary attributes are
called similarity coefficients, and typically have values between 0 and 1. A
value of 1 indicates that the two objects are completely similar, while a value
of 0 indicates that the objects are not at all similar. There are many rationales
for why one coefificient is better than another in specific instances.
Let x and y be two objects that consist of n binary attributes. The com-
parison of two such objects, i.e., two binary vectors, Ieads to the following four
quantities (frequencies) :
.foo : the number of attributes where x is 0 and y is 0
.for : the number of attributes where x is 0 and y is 1
,fio : the number of attributes where x is 1 and y is 0
“fir : the number of attributes where x is 1 and y is 1
Simple Matching Coefficient One commonly used similarity coefficient is
the simple matching coefficient (SMC), which is defined as
number of matching attribute values ft * fss
2.4
Examples of Proximity Measures
S M C :
number of attributes for -l fn * ,frr * ,foo’
(2.5)
74 Chapter 2 Data
This measure counts both presences and absences equally. Consequently, the
SMC could be used to find students who had answered questions similarly on
a test that consisted only of true/false questions.
Jaccard Coefficient Suppose that x and y are data objects that represent
two rows (two transactions) of a transaction matrix (see Section 2.1.2). If each
asymmetric binary attribute corresponds to an item in a store, then a 1 indi-
cates that the item was purchased, while a 0 indicates that the product was not
purchased. Sincb the number of products not purchased by any customer far
outnumbers the number of products that were purchased, a similarity measure
such as SMC would say that all transactions are very similar. As a result, the
Jaccard coefficient is frequently used to handle objects consisting of asymmet-
ric binary attributes. The Jaccard coefficient, which is often symbolized by
,./,’is given by the following equation:
J :
number of matching presences (2.6)
number of attributes not involved in 00 matches fot * fn -t ft
Example 2.17 (The SMC and Jaccard Similarity Coefficients). To
illustrate the difference between these two similarity measures, we calculate
SMC and -I for the following two binary vectors.
x : (1, 0, 0, 0, 0, 0, 0, 0, 0, 0)
y : ( 0 , 0 , 0 , 0 , 0 , 0 , 1 , 0 , 0 , 1 )
for :2 the number of attributes where x was 0 and y was 1
frc: I the number of attributes where x was 1 and y was 0
foo :7 the number of attributes where x was 0 and y was 0
/rr : 0 the number of attributes where x was 1 and y was 1
S M C : – ; – , = – – ; t ‘ – u . l
J : -T – ,+ – – – – ; – : – -4 – :g
J O I T J I O T J I I Z T I T U
Cosine Similarity
Documents are often represented as vectors, where each attribute represents
the frequency with which a particular term (word) occurs in the document. It
is more complicated than this, of course, since certain common words are ig-
J 7 I
Measures of Similaritv and Dissimilaritv 75
nored and various processing techniques are used to account for different forms
of the same word, differing document lengths, and different word frequencies.
Even though documents have thousands or tens of thousands of attributes
(terms), each document is sparse since it has relatively few non-zero attributes.
(The normalizations used for documents do not create a non-zero entry where
there was azero entry; i.e., they preserve sparsity.) Thus, as with transaction
data, similarity should not depend on the number of shared 0 values since
any two documents are likely to “not contain” many of the same words, and
therefore, if 0-0 matches are counted, most documents will be highly similar to
most other documents. Therefore, a similarity measure for documents needs
to ignores 0-0 matches like the Jaccard measure, but also must be able to
handle non-binary vectors. The cosine similarity, defined next, is one of the
most common measure of document similarity. If x and y are two document
vectors, then
cos(x,y) : ffi,
2.4
( , 7 \
where . indicates the vector dot product, x .y : D[:trplp, and llxll is the
length of vector x, ll*ll : 1f D|:rr2r: 1/x4.
Example 2.18 (Cosine Similarity of Two Document Vectors). This
example calculates the cosine similarity for the following two data objects,
which might represent document vectors:
* : ( 3 , 2 , 0 , 5 , 0 , 0 , 0 , 2 , 0 , 0 )
y : ( 1 , 0 , 0 , 0 , 0 , 0 , 0 , 1 , 0 , 2 )
x . y : 3 i . 1 * 2 x 0 * 0 * 0 * 5 x 0 * 0 x 0 f 0 * 0 * 0 * 0 * 2 * l * 0 x 0 * 0 x 2 : 5
l l x l l : roxo *oxo *2×2*oxo *oxo :6 .48
l l y l l : :2 .24
cos(x, y) : 0.31
I
As indicated by Figure 2.16, cosine similarity really is a measure of the
(cosine of the) angle between x and y. Thus, if the cosine similarity is 1, the
angle between x and y is 0o, and x and y are the same except for magnitude
(length). If the cosine similarity is 0, then the angle between x and y is 90o,
and they do not share any terms (words).
76 Chapter 2 Data
Figule 2.16. Geometric illustration of the cosine measure.
Equation 2.7 can be written as Equation 2.8.
cos(x ,y ) : + . , ,Y , , : x ‘ . y ‘ ,
l l x l l l l v l l
(2.8)
where x’ : x/llxll and y/ : V lllyll. Dividing x and y by their lengths normal-
izes them to have a length of 1. This means that cosine similarity does not take
the magnitude ofthe two data objects into account when computing similarity.
(Euclidean distance might be a better choice when magnitude is important.)
For vectors with a length of 1, the cosine measure can be calculated by taking
a simple dot product. Consequently, when many cosine similarities between
objects are being computed, normalizing the objects to have unit length can
reduce the time required.
Extended Jaccard Coefficient (Tanimoto Coefficient)
The extended Jaccard coefficient can be used for document data and that re-
duces to the Jaccard coefficient in the case of binary attributes. The extended
Jaccard coefficient is also known as the Tanimoto coefficient. (However, there
is another coefficient that is also known as the Tanimoto coefficient.) This co-
efficient, which we shall represent as E J , is defined by the following equation:
EJ(x , y ) :
x . y (2.e)
l l ” l l ‘+ l lv l l ‘ – * .v
Correlation
The correlation between two data objects that have binary or continuous vari-
ables is a measure of the linear relationship between the attributes of the
objects. (The calculation of correlation between attributes, which is more
common, can be defined similarly.) More precisely, Pearson’s correlation
Measures of Similaritv and Dissimilaritv 77
coefficient between two data objects, x and y, is defined by the following
equation:
corr(x, y) :
covariance(x, y) StA
(2 .10)
standard-deviation(x) xstandard-deviation(y) tr ta’
where we are using the following standard statistical notation and definitions:
2.4
1 n
covariance(x,y) : s,s: :+ I(“u -z)(yx -9)n – r – .
E : L
(2 .11)
standard-deviation(x) : su :
standard-deviation(y) : su:
1 f l
i : I ) – r * i s t h e m e a n o f x
n 4
k :1
1 f l
g : 1 f y * i s t h e m e a n o f y
h : 1
Example 2.19 (Perfect Correlation). Correlation is always in the range
-1 to 1. A correlation of 1 (-1) means that x and y have a perfect positive
(negative) linear relationship; that is, 16 – aAx -f b, where a, and b are con-
stants. The following two sets of values for x and y indicate cases where the
correlation is -1 and *1, respectively. In the first case, the means of x and y
were chosen to be 0, for simplicity.
x : ( -3 , 6 , 0 , 3 , -6 )
y : ( 1 , – 2 , 0 , – 7 , 2 )
x : ( 3 , 6 , 0 , 3 , 6 )
y : ( 1 , 2 , 0 , L , 2 )
I
fil@n-n)’
; \ l@*-vt ‘
78 Chapter 2 Data
-1.00 -.0.90 4.70 –0.60
0.300.200.10
1.000.80
-o.10
Figure 2.17. Scatter plots illustrating correlations from -1 to 1.
Example 2.20 (Non-linear Relationships). If the correlation is 0, then
there is no linear relationship between the attributes of the two data objects.
However, non-linear relationships may still exist. In the following example,
n*: A7, but their correlation is 0.
* : ( – 3 , – 2 , – 1 , 0 , I , 2 , 3 )
Y : (9 , 4 ,1 ,0 ,1 ,4 ,9 )
I
Example 2.21 (Visualizing Correlation). It is also easy to judge the cor-
relation between two data objects x and y by plotting pairs of corresponding
attribute values. Figure 2.17 shows a number of these plots when x and y
have 30 attributes and the values of these attributes are randomly generated
(with a normal distribution) so that the correlation of x and y ranges from -1
to 1. Each circle in a plot represents one of the 30 attributes; its r coordinate
is the value of one of the attributes for x, while its 3r coordinate is the value
of the same attribute for y. I
If we transform x and y by subtracting off their means and then normaliz-
ing them so that their lengths are 1, then their correlation can be calculated by
Measures of Similaritv and Dissimilaritv 79
taking the dot product. Notice that this is not the same as the standardization
used in other contexts, where we make the transformations, r’* : (rp – ,) lt”
and y’r : (A* – T) I sa.
Bregman Divergence* This section provides a brief description of Breg-
man divergences, which are a family of proximity functions that share some
common properties. As a result, it is possible to construct general data min-
ing algorithms, such as clustering algorithms, that work with any Bregman
divergence. A concrete example is the K-means clustering algorithm (Section
8.2). Note that this section requires knowledge of vector calculus.
Bregman divergences are loss or distortion functions. To understand the
idea of a loss function, consider the following. Let x and y be two points, where
y is regarded as the original point and x is some distortion or approximation
of it. For example, x may be a point that was generated, for example, by
adding random noise to y. The goal is to measure the resulting distortion or
Ioss that results if y is approximated by x. Of course, the more similar x and
y are, the smaller the loss or distortion. Thus, Bregman divergences can be
used as dissimilarity functions.
More formally, we have the following definition.
Definition 2.6 (Bregman Divergence). Given a strictly convex function
@ (with a few modest restrictions that are generally satisfied), the Bregman
divergence (loss function) D(x, y) generated by that function is given by the
following equation:
D(*,y) : d(x) – Q0) – (Vd(v), (* – y)) (2.r2)
where Vd(V) is the gradient of / evaluated at y, x – y, is the vector difference
between x and y, and (Vd(V), (” – V)) is the inner product between Vd(*)
and (x-y). For points in Euclidean space, the inner product is just the dot
product.
D(*,y) can be wri t ten as D(x,V) : d(x) – L(*) , where.L(x) : d$)+
(Vd(V), (* – V)) and represents the equation of a plane that is tangent to the
function Q at y. Using calculus terminology, L(x) is the linearization of. Q
around the point y and the Bregman divergence is just the difference between
a function and a linear approximation to that function. Different Bregman
divergences are obtained by using different choices for S.
Example 2.22. We provide a concrete example using squared Euclidean dis-
tance, but restrict ourselves to one dimension to simplify the mathematics. Let
2.4
80 Chapter 2 Data
r and y be real numbers and /(t) be the real valued function, d(t) : t2. In
that case, the gradient reduces to the derivative and the dot product reduces
to multiplication. Specifically, Equation 2.L2 becomes Equation 2.13.
D(*, i l : 12 – A2 – 2A@ – A) : @ – i l ‘ (2.13)
The graph for this example, with 3r : 1, is shown in Figure 2.18. The
Bregman divergence is shown for two values of r: r :2 and r :3. r
Figure 2.18. lllustration of Bregman divergence.
2.4.6 fssues in Proximity Calculation
This section discusses several important issues related to proximity measures:
(1) how to handle the case in which attributes have different scales and/or are
correlated, (2) how to calculate proximity between objects that are composed
of different types of attributes, e.g., quantitative and qualitative, (3) and how
to handle proximity calculation when attributes have different weights; i.e.,
when not all attributes contribute equally to the proximity of objects.
– 1 0 1 2 3 4
x
2.4 Measures of Similarity and Dissimilarity 81
Standardization and Correlation for Distance Measures
An important issue with distance measures is how to handle the situation
when attributes do not have the same range of values. (This situation is
often described by saying that “the variables have different scales.”) Earlier,
Euclidean distance was used to measure the distance between people based on
two attributes: age and income. Unless these two attributes are standardized,
the distance between two people will be dominated by income.
A related issue is how to compute distance when there is correlation be-
tween some of the attributes, perhaps in addition to differences in the ranges of
values. A generalization of Euclidean distance, the Mahalanobis distance,
is useful when attributes are correlated, have different ranges of values (dif-
ferent variances), and the distribution of the data is approximately Gaussian
(normal). Specifically, the Mahalanobis distance between two objects (vectors)
x and y is defined as
mahalanobis(x,y) : (x – y)>-1(x -y)r, (2.14)
where E-1 is the inverse of the covariance matrix of the data. Note that the
covariance matrix E is the matrix whose ijth entry is the covariance of the ith
and jth attributes as defined by Equation 2.II.
Example 2.23. In Figure 2.19, there are 1000 points, whose r and g at-
tributes have a correlation of 0.6. The distance between the two large points
at the opposite ends of the long axis of the ellipse is 14.7 in terms of Euclidean
distance, but only 6 with respect to Mahalanobis distance. In practice, com-
puting the Mahalanobis distance is expensive, but can be worthwhile for data
whose attributes are correlated. If the attributes are relatively uncorrelated,
but have different ranges, then standardizing the variables is sufficient.
I
Combining Similarities for fleterogeneous Attributes
The previous definitions of similarity were based on approaches that assumed
all the attributes were of the same type. A general approach is needed when the
attributes are of different types. One straightforward approach is to compute
the similarity between each attribute separately using Table 2.7, and then
combine these similarities using a method that results in a similarity between
0 and 1. Typically, the overall similarity is defined as the average of all the
individual attribute similarities.
82 Chapter 2 Data
Figure 2.19. Set of two-dimensional points. The Mahalanobis distance between the two points repre-
sented by large dots is 6;their Euclidean distance is 14.7.
Unfortunately, this approach does not work well if some of the attributes
are asymmetric attributes. For example, if all the attributes are asymmetric
binary attributes, then the similarity measure suggested previously reduces to
the simple matching coefficient, a measure that is not appropriate for asym-
metric binary attributes. The easiest way to fix this problem is to omit asym-
metric attributes from the similarity calculation when their values are 0 for
both of the objects whose similarity is being computed. A similar approach
also works well for handling missing values.
In summary, Algorithm 2.7 is effective for computing an overall similar-
ity between two objects, x and y, with different types of attributes. This
procedure can be easily modified to work with dissimilarities.
Using Weights
In much of the previous discussion, all attributes were treated equally when
computing proximity. This is not desirable when some attributes are more im-
portant to the definition of proximity than others. To address these situations,
2.4 Measures of Similaritv and Dissimilaritv 83
Algorithm 2.1 Similarities of heterogeneous objects.
1-:
2 :
For the kth attribute, compute a similarity, s,r(x,y), in the range [0, 1].
Define an indicator variable, d’7., for the kth attribute as follows:
3: Compute the overall similarity between the two objects using the following for-
mula:
similarity(x,”) :$ffi (2.15)
the formulas for proximity can
each attribute.
If the weights u.r6 sum to L,
be modified by weighting the contribution of
bhen (2.15) becomes
(2 .16)
The definition of the Minkowski distance can also be modified as follows:
d(x, y) : (2.r7)
2.4.7 Selecting the Right Proximity Measure
The following are a few general observations that may be helpful. First, the
type of proximity measure should fit the type of data. For many types of dense,
continuous data, metric distance measures such as Euclidean distance are of-
ten used. Proximity between continuous attributes is most often expressed
in terms of differences, and distance measures provide a well-defined way of
combining these differences into an overall proximity measure. Although at-
tributes can have different scales and be of differing importance, these issues
can often be dealt with as described earlier.
For sparse data, which ofben consists of asymmetric attributes, we typi-
cally employ similarity measures that ignore 0-0 matches. Conceptually, this
reflects the fact that, for a pair of complex objects, similarity depends on the
number of characteristics they both share, rather than the number of charac-
teristics they both lack. More specifically, for sparse, asymmetric data, most
similarity(x, y) : D?:tl|lns !(*’ Y)
DT:’6x
/ n \ 1 / ‘
( D,’ol”r – akl’I
\ t : r /
84 Chapter 2 Data
objects have only a few of the characteristics described by the attributes, and
thus, are highly similar in terms of the characteristics they do not have. The
cosine, Jaccard, and extended Jaccard measures are appropriate for such data.
There are other characteristics of data vectors that may need to be consid-
ered. Suppose, for example, that we are interested in comparing time series.
If the magnitude of the time series is important (for example, each time series
represent total sales of the same organization for a different year), then we
could use Euclidean distance. If the time series represent different quantities
(for example, blood pressure and oxygen consumption), then we usually want
to determine if the time series have the same shape, not the same magnitude.
Correlation, which uses a built-in normalization that accounts for differences
in magnitude and level, would be more appropriate.
In some cases, transformation or normalization of the data is important
for obtaining a proper similarity measure since such transformations are not
always present in proximity measures. For instance, time series may have
trends or periodic patterns that significantly impact similarity. Also, a proper
computation of similarity may require that time lags be taken into account.
Finally, two time series may only be similar over specific periods of time. For
example, there is a strong relationship between temperature and the use of
natural gas, but only during the heating season.
Practical consideration can also be important. Sometimes, a one or more
proximity measures are already in use in a particular field, and thus, others
will have answered the question of which proximity measures should be used.
Other times, the software package or clustering algorithm being used may
drastically limit the choices. If efficiency is a concern, then we may want to
choose a proximity measure that has a property, such as the triangle inequality,
that can be used to reduce the number of proximity calculations. (See Exercise
25.)
However, if common practice or practical restrictions do not dictate a
choice, then the proper choice of a proximity measure can be a time-consuming
task that requires careful consideration of both domain knowledge and the
purpose for which the measure is being used. A number of different similarity
measures may need to be evaluated to see which ones produce results that
make the most sense.
2.5 Bibliographic Notes
It is essential to understand the nature of the data that is being analyzed,
and at a fundamental level, this is the subject of measurement theory. In
We provide professional writing services to help you score straight A’s by submitting custom written assignments that mirror your guidelines.
Get result-oriented writing and never worry about grades anymore. We follow the highest quality standards to make sure that you get perfect assignments.
Our writers have experience in dealing with papers of every educational level. You can surely rely on the expertise of our qualified professionals.
Your deadline is our threshold for success and we take it very seriously. We make sure you receive your papers before your predefined time.
Someone from our customer support team is always here to respond to your questions. So, hit us up if you have got any ambiguity or concern.
Sit back and relax while we help you out with writing your papers. We have an ultimate policy for keeping your personal and order-related details a secret.
We assure you that your document will be thoroughly checked for plagiarism and grammatical errors as we use highly authentic and licit sources.
Still reluctant about placing an order? Our 100% Moneyback Guarantee backs you up on rare occasions where you aren’t satisfied with the writing.
You don’t have to wait for an update for hours; you can track the progress of your order any time you want. We share the status after each step.
Although you can leverage our expertise for any writing task, we have a knack for creating flawless papers for the following document types.
Although you can leverage our expertise for any writing task, we have a knack for creating flawless papers for the following document types.
From brainstorming your paper's outline to perfecting its grammar, we perform every step carefully to make your paper worthy of A grade.
Hire your preferred writer anytime. Simply specify if you want your preferred expert to write your paper and we’ll make that happen.
Get an elaborate and authentic grammar check report with your work to have the grammar goodness sealed in your document.
You can purchase this feature if you want our writers to sum up your paper in the form of a concise and well-articulated summary.
You don’t have to worry about plagiarism anymore. Get a plagiarism report to certify the uniqueness of your work.
Join us for the best experience while seeking writing assistance in your college life. A good grade is all you need to boost up your academic excellence and we are all about it.
We create perfect papers according to the guidelines.
We seamlessly edit out errors from your papers.
We thoroughly read your final draft to identify errors.
Work with ultimate peace of mind because we ensure that your academic work is our responsibility and your grades are a top concern for us!
Dedication. Quality. Commitment. Punctuality
Here is what we have achieved so far. These numbers are evidence that we go the extra mile to make your college journey successful.
We have the most intuitive and minimalistic process so that you can easily place an order. Just follow a few steps to unlock success.
We understand your guidelines first before delivering any writing service. You can discuss your writing needs and we will have them evaluated by our dedicated team.
We write your papers in a standardized way. We complete your work in such a way that it turns out to be a perfect description of your guidelines.
We promise you excellent grades and academic excellence that you always longed for. Our writers stay in touch with you via email.