Last month, at Tech Talk Tuesday, we formed a team for the Kaggle Competition Getting Started with Julia. Last week, out team Data Science Saigon took the number one spot on the leaderboard. Here's how it happened.
Atlasian, a company with offices here in Saigon. We reviewed DataFrames with Julia, and got the code from the Julia Tutorial and the K-Nearest-Neighbor tutorial working with help from the Kaggle forums. In particular, some method calls have changed since the tutorials were created, but we found the workaround in the Convert method error when reading image files topic. Our versions work as of Julia 0.3.7. Implementing the tutorial code got us to about 46th place at the time.
But boosting parameters to our Random Forest algorithm didn't drastically improve our score. This is when we found out about the Mocha package.
Convolutional Neural Networks
The recent resurgence in popularity of Neural Networks is due the the amazing performance of Convolutional Neural Networks (CNNs) at image classification. This was exactly the solution our problem needed. Dung Thai is very knowledgeable about Deep Learning, and encouraged us to try out the Mocha Package for Deep Learning in Julia. As a result we quickly moved into the top 20 on the leaderboard.
Pulling out all the StopsAt our next meeting Dung (pronounced Yung) summarized Learn from the Best , and we talked about how to get to the next level. Data Science Saigon has talent across a variety of platforms and languages including C++, Caffe, Scilab, Python and of course Julia. We also noticed a few things about the rules for this particular competition:
- Outside Data is not forbidden
- Semi-supervised learning is not forbidden
- The language does not have to be Julia
We pondered how a Convolutional Network form a good Python library like Theano would perform. We also accessed lots more training images from the Chars74k dataset and the Street View House Numbers dataset.
Saigon là số một.
Then last week Dung Thai, Vinh Vu, and Nguyen Quy checked in Python code using Theano that recognizes over 92% of the images correctly, and vaulted us into the #1 spot on the Getting Started with Julia leaderboard. Congratulations to everyone taking part in Data Science Saigon.
Our Remaining Challenge
So clearly, training with lots more data improved the score. But the question remains, would using a CNN in Julia with the additional training data generate a similar score as the Python code? We hope to find out when we meet again. All of our code is here.