This is the fourth video tutorial on support vector machines (SVMs) with scikit-learn on the cancer dataset. In the last video, we used a support vector classifier (SVC) with an RBF kernel and with all default parameters and we trained it on our dataset.
We noticed how it overfits the training data (by getting 100% performance) and how it poorly performs on the test subset. There are several reasons that could lead to the decreased performance of the algorithm. Some of them include the scale of the data and also the adjusting of the hyper-parameters.
In this video we're gonna use matplotlib to visualize our data and to understand what it means for it to be unscaled - the difference in orders of magnitude between the values of each feature and the difference in magnitude in-between features.
Then, in the next tutorial we're gonna try to remediate this issue by scaling the data. Please watch the video below for the full scoop.
As a reminder:
In this series I'm going to explore the cancer dataset that comes pre-loaded with scikit-learn. The purpose is to train the classifiers on this dataset, which consists of labeled data: ~569 tumor samples, each labeled malignant or benign, and then use them on new, unlabeled data.
Previous videos in this series:
- Machine Learning on a Cancer Dataset - Part 20
- Machine Learning on a Cancer Dataset - Part 21
- Machine Learning on a Cancer Dataset - Part 22
- Machine Learning on a Cancer Dataset - Part 23
- Machine Learning on a Cancer Dataset - Part 24
- Machine Learning on a Cancer Dataset - Part 25
- Machine Learning on a Cancer Dataset - Part 26
- Machine Learning on a Cancer Dataset - Part 27
To stay in touch with me, follow
Cristi Vlad, Self-Experimenter and Author