Introduction

  • Classification attempts to solve the problem of assigning classes to data.
  • It involves allocating a class to new, unassigned cases based on existing data.

Example:

CaseAgeSalaryClass
150401
232200
336450
455551
561500
629300
748350
865451
923400
1051251

New data:

CaseAgeSalaryClass
116045?
124030?
134740?


cs3002-supervised-learning-decision-trees


cs3002-supervised-learning-knearest-neighbour


Testing Performance

  • Aim is to classify new unseen data.
  • Often look at a simple error rate to assess our classifiers.
  • error = number of errors / number of cases
  • Empirical error rate is not the same as true error rate.
    • Empirical is based on a sample.
    • True is based on infinite cases.
  • Is it possible to estimate the true error rate?

Training and Test Sets

  • Obviously a good idea to split data into a training set and a test set.
  • Known as the holdout method.
  • Use training set to learn model.
  • Use test set to score the accuracy.
  • Ideally the two sets are independent (generated separately).

Resampling

  • What if the sample of data is small or biased?
  • Resampling methods can be used.
    • Randomly select training and test sets.
    • Repeat for a fixed number of iterations.
  • Methods include:
    • Cross-validation.
    • Bootstrapping.

Cross-validation

  • K-fold cross validation:
    • Randomly split the data up into subsets of equal size.
    • Remove one of the subsets and train classifier on remaining subsets.
    • Test on the removed subset.
    • Repeat for all subsets.
  • Cross-validation is considered an unbiased estimator of true error.

Bootstrapping

  • For the bootstrap training data items are sampled with replacement from cases.
  • Cases that are not found in the training set are used for the test set.
  • Generally produces worse rates than the true error rate (worse case scenario).

Resampling and Random Forests

cs3002-supervised-learning-confusion-matrix

Bias, Variance and Overfitting

  • Consider two models:

  • Bias is some systematic error in the model.
  • Variance is the difference from one model to the next.
  • The first model (straight line) has high bias.
  • The second has high variance:
    • It fits the data very well.
    • But it will not predict new cases.
    • It has overfit to the data.

Summary

  • Decision Trees
    • Decision trees use a tree-like structure to make decisions and classify data.
    • They are easy to interpret but can be prone to overfitting, although pruning can help mitigate this.
  • K-Nearest Neighbour
    • KNN is based on the proximity of data points in the feature space.
    • It’s easy to interpret, but it does not model data explicitly.
  • Testing Performance
    • Different methods are used to assess classifier performance, such as sensitivity analysis.
  • Resampling
    • Resampling techniques like cross-validation and bootstrapping are employed to address issues like small or biased datasets.
  • Confusion Matrix, Sensitivity and Specificity
    • The confusion matrix is useful for considering the importance of errors.
    • Sensitivity and specificity are common measures for evaluating classifier performance.
  • confuPrecision and Recall
    • Precision and recall are particularly relevant for imbalanced data, where one class is much larger than the other.
  • ROC Curves vs. PR Curves
    • ROC curves illustrate the tradeoff between sensitivity and specificity.
    • PR curves illustrate the tradeoff between precision and recall.
  • Bias, Variance, and Overfitting
    • Models can exhibit bias (systematic errors) or high variance (overfitting).
    • Overfit models may perform well on training data but poorly on new data.