Right now, machine learning is a big thing (from Google's self driving car to disease prediction!) And in this field, DATA is everything. If you dont have a good quality data, then your accuracy is gonna be a mess. So it's customary to preprocess your data. But there are so many preprocessing techniques??? Which one/s should I use? There are already a lot of tutorials online, but I'd like to share what I've been doing in my current project.
Preprocessing Techniques
Original and Downsized(Top) vs
Cropped and Downsized(Bottom)
It all depends on your data. The data that I am currently working on are images from multiple databases. Imagine the heterogeneity of the data! So here are some preprocesing methods that I can do:
- Normalization > I adjust the pixel values to a range of 0 to 1. I do this to avoid the values from blowing up.
- Downsizing the image > Large images took too long to load, so downsizing the images can speed up the computation.
- Cropping the image > Sometimes there are unnecessary portions in the images, I can just crop on the area that I want to focus on
Data Augmentation
Another thing in machine learning, you must have a BIG DATA! The bigger the better. This is because the data is used to learn. A technique that I can do to increase the amount of data is data augmentation. For images I do the following:
- Random adjustment of the brightness
- Random adjustment of the contrastt
- Random rotation
- Random flipping
Here are sample augmentations featuring a sleeping Loki.
Python Libraries
I wrote the scripts that I used in the images, but here are some libraries that you can use:- OpenCV
- Pillow
- Scikit-image
This pipeline produces a total of 1000 images which are cropped at the center,
rotated and zoomed according to a certain probability.
I know preprocessing is tedious, but using such libraries help us speed up the process. If the data is all set, I can now start feeding it to my machine learning algorithm(which can be a topic for another blogpost)! I hope this post helps you on your future projects. :-)