Thursday, July 23, 2015

How Data Science Saigon took the lead in a Kaggle Competition


Last month, at Tech Talk Tuesday,  we formed a team for the Kaggle Competition Getting Started with Julia. Last week, out team Data Science Saigon took the number one spot on the leaderboard.  Here's how it happened.


Entering a Kaggle competition

You've got to be in it to win it.  When our team met on June 16th, we created accounts on Kaggle's site, and on bitbucket by Atlasian, a company with offices here in Saigon.  We reviewed DataFrames with Julia, and got the code from the Julia Tutorial and the K-Nearest-Neighbor tutorial working with help from the Kaggle forums.  In particular,  some method calls have changed since the tutorials were created, but we found the workaround in the Convert method error when reading image files topic.  Our versions work as of Julia 0.3.7.   Implementing the tutorial code got us to about 46th place at the time.

Machine Learning with Julia

At our next meeting we took a look at how to build a predictive model based on the Random Forest algorithm.


But boosting parameters to our Random Forest algorithm didn't drastically improve our score.  This is when we found out about the Mocha package.

Convolutional Neural Networks 



The recent resurgence in popularity of Neural Networks is due the the amazing performance of Convolutional Neural Networks (CNNs) at image classification.  This was exactly the solution our problem needed.   Dung Thai is very knowledgeable about Deep Learning, and encouraged us to try out the Mocha Package for Deep Learning in Julia. As a result we quickly moved into the top 20 on the leaderboard.




Pulling out all the Stops

At our next meeting Dung (pronounced Yung) summarized Learn from the Best , and we talked about how to get to the next level. Data Science Saigon has talent across a variety of platforms and languages including C++, Caffe, Scilab, Python and of course Julia.  We also noticed a few things about the rules for this particular competition:

  1. Outside Data is not forbidden
  2. Semi-supervised learning is not forbidden
  3. The language does not have to be Julia  
We pondered how a Convolutional Network form a good Python library like Theano would perform. We also accessed lots more training images from the Chars74k dataset and the Street View House Numbers dataset

Saigon là số một.

 Then last week Dung Thai,  Vinh Vu, and  Nguyen Quy checked in Python code using Theano that recognizes over 92% of the images correctly, and vaulted us into the #1 spot on the Getting Started with Julia leaderboard.  Congratulations to everyone taking part in Data Science Saigon.

Our Remaining Challenge

So clearly, training with lots more data improved the score.  But the question remains,  would using a CNN in Julia with the additional training data generate a similar score as the Python code?  We hope to find out when we meet again. All of our code is here.  






Monday, July 20, 2015

A Brief History of Neural Networks



In computer science and specifically machine learning, programmers have been trying to simulate the behavior of the brain since the late 1940s.  The fundamental pattern of the brain is modeled by programmers as loosely connected nodes capable of learning and modifying their behavior as information is processed. 


In 1948 Alan Turing's paper Intelligent Machinery called these loosely connected nodes unorganized machines, and compared them to an infant's brain.
Neural Network used in a BioWall



In the 1950s Frank Rosenblatt developed the Perceptron, a binary classification algorithm and one of the first implementations of a Neural Network.  Programmers soon realized that neural networks were only very effective if they had 2 or more layers.  But machine processing power at the time prevented implementing anything useful.




By the 1970s machines had improved, and Neural Networks again gathered interest, but were soon surpassed in utility by simpler classification algorithms such as Support Vector Machines and linear classifiers.
Samples from two classes.  Samples on the margin between them are called support vectors.


This century, Neural Networks have made strides again with the invention of Deep LearningGeoffrey E. Hinton of the University of Toronto improved classification results by training each layer of a neural network separately.  As a result, many classification competitions are now won using Deep Neural Networks, often running on GPU processors.



Scilab , Python's Theano ,  Julia's Mocha,  and Caffe are all  focused on deep learning and neural networks.  Watch these projects evolve as deep learning gathers momentum.




Popular Articles