Showing posts with label Data Science Saigon. Show all posts
Showing posts with label Data Science Saigon. Show all posts

Thursday, July 23, 2015

How Data Science Saigon took the lead in a Kaggle Competition

Last month, at Tech Talk Tuesday,  we formed a team for the Kaggle Competition Getting Started with Julia. Last week, out team Data Science Saigon took the number one spot on the leaderboard.  Here's how it happened.

Entering a Kaggle competition

You've got to be in it to win it.  When our team met on June 16th, we created accounts on Kaggle's site, and on bitbucket by Atlasian, a company with offices here in Saigon.  We reviewed DataFrames with Julia, and got the code from the Julia Tutorial and the K-Nearest-Neighbor tutorial working with help from the Kaggle forums.  In particular,  some method calls have changed since the tutorials were created, but we found the workaround in the Convert method error when reading image files topic.  Our versions work as of Julia 0.3.7.   Implementing the tutorial code got us to about 46th place at the time.

Machine Learning with Julia

At our next meeting we took a look at how to build a predictive model based on the Random Forest algorithm.

But boosting parameters to our Random Forest algorithm didn't drastically improve our score.  This is when we found out about the Mocha package.

Convolutional Neural Networks 

The recent resurgence in popularity of Neural Networks is due the the amazing performance of Convolutional Neural Networks (CNNs) at image classification.  This was exactly the solution our problem needed.   Dung Thai is very knowledgeable about Deep Learning, and encouraged us to try out the Mocha Package for Deep Learning in Julia. As a result we quickly moved into the top 20 on the leaderboard.

Pulling out all the Stops

At our next meeting Dung (pronounced Yung) summarized Learn from the Best , and we talked about how to get to the next level. Data Science Saigon has talent across a variety of platforms and languages including C++, Caffe, Scilab, Python and of course Julia.  We also noticed a few things about the rules for this particular competition:

  1. Outside Data is not forbidden
  2. Semi-supervised learning is not forbidden
  3. The language does not have to be Julia  
We pondered how a Convolutional Network form a good Python library like Theano would perform. We also accessed lots more training images from the Chars74k dataset and the Street View House Numbers dataset

Saigon là số một.

 Then last week Dung Thai,  Vinh Vu, and  Nguyen Quy checked in Python code using Theano that recognizes over 92% of the images correctly, and vaulted us into the #1 spot on the Getting Started with Julia leaderboard.  Congratulations to everyone taking part in Data Science Saigon.

Our Remaining Challenge

So clearly, training with lots more data improved the score.  But the question remains,  would using a CNN in Julia with the additional training data generate a similar score as the Python code?  We hope to find out when we meet again. All of our code is here.  

Popular Articles