Relates to the unsupervised learning lecture.

Q1) Given the following three attributes:

ID12345678910
Age23571832267166435218
Blood Pressure9515085110100801001059590
BMI25202723183027253529

Choose any 2 patients and calculate:

I chose 3 and 4.

a) What is the Euclidean distance between each of them?

ed =

=

=

b) What is the Manhattan distance between each of them?

=


Q2) Briefly describe three different forms of hierarchical clustering methods.

  • Single linkage: smallest distance between any two pairs
  • Average linkage: average distance between pairs
  • Complete linkage: largest distance between any two pairs

Q3) Describe two clustering methods and their advantages/disadvantages.

  • Hierarchical: it is a way of grouping things together based on how similar or different they are from each other. For example, if you have different animals, you start by putting each in its own group. Then, we look at them and put the most similar together in a new group. You keep doing this until you have a few big groups that represent different categories of animals. We merge close clusters together and the result is a dendrogram.

Pros:

  • can produce an ordering of the objects, could be informative for data display.
  • smaller clusters, being helpful for discovery.

Cons:

  • cant reallocate object that has been ‘incorrectly’ grouped at an early stage.
  • different distance metrics for measuring distances might generate different results.

  • K-Means: we first decide how many groups we want, for example 3. we then randomly pick 3 “items” to be the centre of each group. now, we look at each item and see which centre it is closest to, we put then with the closest centre. after we have them all together in groups, we find a new centre by taking the average of all the items in the group. we keep repeating this until the centres don’t change too much.

Pros:

  • can be computationally faster than hierarchical if K is small.
  • may produce tighter clusters than hierarchical.

Cons:

  • fixed number of clusters can make it difficult to figure out K.
  • different initial partitions can result in different final clusters.
  • does not work well with non-globular clusters.