Confusion Matrix: Evaluation Measures for Classification Problems
created by Angad Gupta

Confusion Matrix: Evaluation Measures for Classification Problems

In data mining, classification involves the problem of predicting which category or class a new observation belongs in. The derived model (classifier) is based on the analysis of a set of training data where each data is given a class label. The trained model (classifier) is then used to predict the class label for new, unseen data. To understand classification metrics, one of the most important concepts is the confusion matrix.

Confusion matrix :

A Confusion matrix is an N x N matrix used for evaluating the performance of a classification model, where N is the number of target classes. The matrix compares the actual target values with those predicted by the machine learning model. This gives us a holistic view of how well our classification model is performing and what kinds of errors it is making.

No alt text provided for this image
  • The target variable has two values: Positive or Negative
  • The columns represent the actual values of the target variable
  • The rows represent the predicted values of the target variable

Interpretation: Each cell has 2 values, one either TRUE OR FALSE and SECOND POSITIVE or NEGATIVE. Let's see how to remember easily..

For TRUE or FALSE, just remember AND logic 1 & 1 --> 1 (True) and 0&0 --> 1 (True) remaining all are 0 (False). first row & first colum 1 & 1 that is 1 (true), first row and second column 1 & 0 that is 0 (False) like wise second row first colum 0 & 1 , 0 (false) and second row & second colum , 0 & 0 tthat is 0 (TRUE).

For the Second value positive and negative has to be selected based on the row label, so for the first row it's positive and for second-row it's negative.

Likewise, we can easily remember TP, FP, FN & TN

Understanding True Positive, True Negative, False Positive and False Negative in a Confusion Matrix

True Positive (TP) 

  • The predicted value matches the actual value
  • The actual value was positive and the model predicted a positive value

True Negative (TN) 

  • The predicted value matches the actual value
  • The actual value was negative and the model predicted a negative value

False Positive (FP) – Type 1 error

  • The predicted value was falsely predicted
  • The actual value was negative but the model predicted a positive value
  • Also known as the Type 1 error

False Negative (FN) – Type 2 error

  • The predicted value was falsely predicted
  • The actual value was positive but the model predicted a negative value
  • Also known as the Type 2 error

Example for a better understanding of TP, TN , FP & FN

No alt text provided for this image
  • True Positive (TP) = 560; meaning 560 positive class data points were correctly classified by the model
  • True Negative (TN) = 330; meaning 330 negative class data points were correctly classified by the model
  • False Positive (FP) = 60; meaning 60 negative class data points were incorrectly classified as belonging to the positive class by the model
  • False Negative (FN) = 50; meaning 50 positive class data points were incorrectly classified as belonging to the negative class by the model

Mathematical Interpretation of confusion matrix

import numpy as np
import sklearn.datasets
import sklearn.linear_model
import sklearn.metrics
from sklearn.model_selection import train_test_split

# do not change for reproducibility
np.random.seed(42) 

# Importing the dataset
dataset = sklearn.datasets.fetch_covtype()

# only use a random subset for speed - pretend the rest of the data doesn't exist
random_sample = np.random.choice(len(dataset.data), len(dataset.data) // 10)

# We are only intersted in Class 3 forest type.
COVER_TYPE = 3
features = dataset.data[random_sample, :]
target = dataset.target[random_sample] == COVER_TYPE

# Doing the 80-20% train test split of the data
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.2)

# Building the basic Logistic Regression
classifier = sklearn.linear_model.LogisticRegression(solver='liblinear')
classifier.fit(X_train,  y_train)
predictions = classifier.predict(X_test)

# Printing out Confusion matrix for our predictions
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, predictions))

On execution, you will see the confusion matrix for our predictions.

[[10766 169

[ 235 451]]

1. Accuracy

The accuracy of a classifier is given as the percentage of total correct predictions divided by the total number of instances. Mathematically,

No alt text provided for this image

If the accuracy of the classifier is considered acceptable, the classifier can be used to classify future data tuples for which the class label is not known. But is it the case with us? Let's see if accuracy is the right evaluation metric for this problem.

No alt text provided for this image
No alt text provided for this image

Accuracy will be reliable when we have somewhat equal proportions of data (50-50 of true and false class labels) and always unreliable if the data set is unbalanced. Of most of the data mining problems, accuracy is the least-used metric because it does not give correct information on predictions.

2. Recall

Recall is one of the most used evaluation metrics for an unbalanced dataset. It calculates how many of the actual positives our model predicted as positives (True Positive).

Recall is also known as true positive rate (TPR), sensitivity, or probability of detection.

Mathematically,

No alt text provided for this image

In the confusion matrix:

No alt text provided for this image

Recall = 451/(451+235) =65.74%

No alt text provided for this image
No alt text provided for this image

3. Precision

Precision describes how accurate or precise our data mining model is. Out of those cases predicted positive, how many of them are actually positive.?

Precision is also called a measure of exactness or quality, or positive predictive value.

Mathematically,

No alt text provided for this image

In the confusion matrix,:

No alt text provided for this image

Precision: 451/(451+169) = 72.74

No alt text provided for this image
No alt text provided for this image

4. F1 Score

When both recall and precision are necessary, then the F1 score comes into the picture. It tries to balance out both recall and precision. Remember, it is still better than accuracy, as with an F1 score we are not looking for any true negative data.

Mathematically, it is defined as a harmonic mean of recall and precision:

No alt text provided for this image

F1 Score = 2 x 72.74 x 65.74/(72.74+65.74) = 69.07

No alt text provided for this image

The F score reaches the best value, meaning perfect precision and recall, at a value of 1. The worst F score, which means the lowest precision and lowest recall, would be a value of 0. 

5. ROC Curve

Sometimes it's not easy to find out which evaluation metric to use, and visualizing with different thresholds can help us select the best evaluation metric.

Receiver Operating Characteristics curves, or ROC curves, are graphs that show the performance of a classification model at all classification thresholds. An ROC curve is a useful visual tool for comparing two classification models. ROC depicts the performance trade-off between the true positive rate (TPR) and false positive rate (FPR) of a classification model.

Mathematically,

No alt text provided for this image

When we lower the threshold of a classifier, it classifies more items as positive, thus increasing both false positives and true positives.

No alt text provided for this image

ROC is one of the most popular plots, which helps in the interpretation of a classifier.

6. Specificity

No alt text provided for this image

7. Summary

  • Precision is how certain you are of your true positives. Recall is how certain you are that you are not missing any positives.
  • Choose Recall if the occurrence of false negatives is unaccepted/intolerable. For example, in the case of diabetes that you would rather have some extra false positives (false alarms) over saving some false negatives.
  • Choose Precision if you want to be more confident of your true positives. For example, in case of spam emails, you would rather have some spam emails in your inbox rather than some regular emails in your spam box. You would like to be extra sure that email X is spam before we put it in the spam box.
  • Choose Specificity if you want to cover all true negatives, i.e. meaning we do not want any false alarms or false positives. For example, in case of a drug test in which all people who test positive will immediately go to jail, you would not want anyone drug-free going to jail.


#datascience #machinelearning #regression #multipleregression #MLR #python #statistics #statemodel #modeling #model interpretation #MLR #linearregression #learning #ml #datascience #datamodeloing #dataevalution #datavisualization  #gupta #clusttering #k-means #unsupervisiedlearning #iris #learning #clusteringexample #slearn #userinterface #GUI #thinkar #evalutionmeasures #model #confusionmatrix #Classification #predicted #actual #positive #negative #TP #TN #FP #FP #recall #sensitivity #precision #accuracy #FTEST #ROCcurve #specificity



Very well explained!!! Easy to understand. And by the way, the summary is written using a very practical and concise approach. Thanks for the post. Keep publishing :)

To view or add a comment, sign in

More articles by Angad Gupta ,MIEEE, BITS-Pilani

Others also viewed

Explore content categories