Machine Learning on a Cancer Dataset

This is the fourth video tutorial on support vector machines (SVMs) with scikit-learn on the cancer dataset. In the last video, we used a support vector classifier (SVC) with an RBF kernel and with all default parameters and we trained it on our dataset.

We noticed how it overfits the training data (by getting 100% performance) and how it poorly performs on the test subset. There are several reasons that could lead to the decreased performance of the algorithm. Some of them include the scale of the data and also the adjusting of the hyper-parameters.

In this video we're gonna use matplotlib to visualize our data and to understand what it means for it to be unscaled - the difference in orders of magnitude between the values of each feature and the difference in magnitude in-between features.

Then, in the next tutorial we're gonna try to remediate this issue by scaling the data. Please watch the video below for the full scoop.

As a reminder:

In this series I'm going to explore the cancer dataset that comes pre-loaded with scikit-learn. The purpose is to train the classifiers on this dataset, which consists of labeled data: ~569 tumor samples, each labeled malignant or benign, and then use them on new, unlabeled data.

Previous videos in this series:

To stay in touch with me, follow Hive account@cristi

Cristi Vlad, Self-Experimenter and Author

Machine Learning on a Cancer Dataset - Part 28

To stay in touch with me, follow Hive account@cristi