In 2018, I picked up a book called Machine Learning ironically written by a guy named Mark Graph. It took me about two and a half months to finish it and I would be lying if I told you I understood it. What I did understand was the small ember of curiosity it sparked into my data journey. I took my first statistics course in graduate school. The highest math class I had taken before that was at SCAD, called “Math”. At first I struggled understanding the gritty math of regressions, but the concepts learned reignited that curiosity of machine learning. When 2020(yes the whole year) happened, I decided to try and jump into some more practical applications of Machine Learning.
As an overconfident script kitty, I wanted to continue to understand the predictive powers of machine learning with the new statistics and econometrics skills I was acquiring. I took two courses from Udemy called Machine Learning A-Z and Deep Learning. These courses were a great introduction to how neural networks and deep learning work with data sets. It was a great pairing with the SQL class I would be taking that fall semester.
Flash forward to my final semester in graduate school, I took a course called Advanced Modeling and Analytics which focused on deep learning and neural networks. The website Kaggle is a great source of datasets for testing your data science skills. I started diving into some of the data sets and after hours of looking at code and googling what each command did, I decided to try and form a nice template of commands that could be used for computer vision and classifying sets of like pictures.
In the spirit of full transparency, I do not want to lead you to think that the following code was done in one shot. Some code is copy and pasted from documentation or other users.
This was compiled over hours of googling, kaggle searches, and reloading the same documentation repeatedly…
The dataset I chose to focus on was the American Sign Language(ASL) data set. This consists of pictures of hands signing different letters of the ASL as well as a few random images with nothing in the frame.
I noticed that most of the notebooks on Kaggle set goals for the code, so I followed suit and set the goal of >90% accuracy of the model.
To start we needed to gather all the libraries that we would need:
Depending on how the data is structured on Kaggle or any database, we needed to separate the training and test data in order to build the neural network correctly. Luckily this dataset already has training and test folders separated and we just needed to create a function to go through the data and create labels within a pandas data frame structure of the file path and labels.
Let’s take a look at an overview of the training data set:
Text overviews are great for some datasets, but since we are working with images, it would be better if we can visualize the dataset and see what we are working with. Fortunately, Kaggle is a great community of data junkies who are a lot smarter than I am. I found the following code that displays a certain amount of data images in thumbnail form. Within our original function to define the data frame, we structured it to randomize the dataset as to avoid any complications.
Whenever I find code that I want to implement into my own projects, I go through the code written and comment out every line and run it step by step. This helps me to wrap my head around how the function or code is structured in case I need to adjust it for the dataset in use. As an art school kid, the visual representation helped me digest the scope of the dataset and how we needed to analyse the images.
Now we can get into the fun part of Machine Learning modeling of writing a bunch more code, then tweaking the code, then hoping it works! /s
We first need to preprocess the images for the model we plan on using(MobileNet_V2). It is important to note that depending on which model you plan on using for your neural network, that you choose the right preprocessing structure. The model will give you error messages when fitting if the preprocess is not aligned.
Next is building the training, validation, and test set of the images. This part of the overall “template” needs to be adjusted depending on the amount of classifications you are expecting from the set or hardware restrictions.
For example, “Class_mode=categorical” is used when a dataset is expecting multiple categories, as opposed to “Class_mode=binary” meaning that the output is one of two choices.
The batch size can also be adjusted to save on time or increase accuracy of the build. Typically the lower the batch size this more time, but the higher the accuracy of the final model. The batch size also affects the amount of ram or gpu that is being utilized during training the model.
The output gives us the amount of images in each split as well as the number of classifications. If this were set in “binary” the classifications would read 2 instead of 2
Now that the splits have occurred, we need to build the actual feature extraction model to fit the data into. After some research and testing, MobileNet_v2 gave the best results. Now we can fit the model and train it.
I added the adam variable in order to be able to change the learning rate of the algorithm if I needed to. I think this is a good practice if only to understand how the process and functions correlate to the outputs.
The history variable will be the data frame that we use to analyze and visualize the model. Within this variable the epochs define the amount of “rounds” the model will train through.
Patience refers to the amount of epochs that must pass without increased accuracy for the model to stop proceeding to the next epoch.
Now we can start seeing how the model performed. The final test loss was .17 with an accuracy of 95.06%. The lower the loss the better, but be careful of overfitting the model.
Next we can run a predict function on the test data and see how the model will predict the first 5 images.
In order to visualize the model across the classifications we can run classifiation_reporter.
I was confused by this output at first so hopefully this TLDR will help understanding classification reports.
Precision = Accuracy of Positive Predictions
Recall = Fraction of Positives correctly Identified/All Positives (Both False Positives and True Positives)
F1-Score = Harmonic Mean of Precision and Recall (Best = 1,Worst = 0)
Support = Number of classification occurrences in the dataset
Finally we can take the same code from above and display the test image data set in thumbnails. One great addition (not mine) to this code is the addition of labels for the predicted classification and the test label letting us know if the prediction was correct or not.
Recall earlier the model predicted the test set would start with [‘P’, ‘A’, ‘W’, ‘del’, ‘E’].
Overall we achieved the goal of an convolution neural network model with an accuracy rate above 90%.
There may still be room to improve this model but I think it is a great start to where I want to take the project next.
If you want to see the code in action: https://www.kaggle.com/chrisflowers/asl-classification-95
The next steps are incorporating this model into a live computer vision algorithm that connects to web cam and can identify what letter is being signed, and record it to a database.
I am excited to implement this network into live video to test it. I wonder if I will need a higher accuracy percentage in order to classify live video. Stay tuned for the next segment in the journey.
Thanks for reading! Hope it helps or sparks some inspiration!