
  • Classification attempts to solve the problem of assigning classes to data.
  • It involves allocating a class to new, unassigned cases based on existing data.



Testing Performance

  • Aim is to classify new unseen data.
  • Often look at a simple error rate to assess our classifiers.
  • error = number of errors / number of cases
  • Empirical error rate is not the same as true error rate.
    • Empirical is based on a sample.
    • True is based on infinite cases.
  • Is it possible to estimate the true error rate?

Training and Test Sets

  • Obviously a good idea to split data into a training set and a test set.
  • Known as the holdout method.
  • Use training set to learn model.
  • Use test set to score the accuracy.
  • Ideally the two sets are independent (generated separately).


  • What if the sample of data is small or biased?
  • Resampling methods can be used.
    • Randomly select training and test sets.
    • Repeat for a fixed number of iterations.
  • Methods include:
    • Cross-validation.
    • Bootstrapping.


  • K-fold cross validation:
    • Randomly split the data up into subsets of equal size.
    • Remove one of the subsets and train classifier on remaining subsets.
    • Test on the removed subset.
    • Repeat for all subsets.
  • Cross-validation is considered an unbiased estimator of true error.


  • For the bootstrap training data items are sampled with replacement from cases.
  • Cases that are not found in the training set are used for the test set.
  • Generally produces worse rates than the true error rate (worse case scenario).

Resampling and Random Forests


Bias, Variance and Overfitting

  • Consider two models:

  • Bias is some systematic error in the model.
  • Variance is the difference from one model to the next.
  • The first model (straight line) has high bias.
  • The second has high variance:
    • It fits the data very well.
    • But it will not predict new cases.
    • It has overfit to the data.


