Showing posts with label Python. Show all posts
Showing posts with label Python. Show all posts

Thursday, July 23, 2015

How Data Science Saigon took the lead in a Kaggle Competition


Last month, at Tech Talk Tuesday,  we formed a team for the Kaggle Competition Getting Started with Julia. Last week, out team Data Science Saigon took the number one spot on the leaderboard.  Here's how it happened.


Entering a Kaggle competition

You've got to be in it to win it.  When our team met on June 16th, we created accounts on Kaggle's site, and on bitbucket by Atlasian, a company with offices here in Saigon.  We reviewed DataFrames with Julia, and got the code from the Julia Tutorial and the K-Nearest-Neighbor tutorial working with help from the Kaggle forums.  In particular,  some method calls have changed since the tutorials were created, but we found the workaround in the Convert method error when reading image files topic.  Our versions work as of Julia 0.3.7.   Implementing the tutorial code got us to about 46th place at the time.

Machine Learning with Julia

At our next meeting we took a look at how to build a predictive model based on the Random Forest algorithm.


But boosting parameters to our Random Forest algorithm didn't drastically improve our score.  This is when we found out about the Mocha package.

Convolutional Neural Networks 



The recent resurgence in popularity of Neural Networks is due the the amazing performance of Convolutional Neural Networks (CNNs) at image classification.  This was exactly the solution our problem needed.   Dung Thai is very knowledgeable about Deep Learning, and encouraged us to try out the Mocha Package for Deep Learning in Julia. As a result we quickly moved into the top 20 on the leaderboard.




Pulling out all the Stops

At our next meeting Dung (pronounced Yung) summarized Learn from the Best , and we talked about how to get to the next level. Data Science Saigon has talent across a variety of platforms and languages including C++, Caffe, Scilab, Python and of course Julia.  We also noticed a few things about the rules for this particular competition:

  1. Outside Data is not forbidden
  2. Semi-supervised learning is not forbidden
  3. The language does not have to be Julia  
We pondered how a Convolutional Network form a good Python library like Theano would perform. We also accessed lots more training images from the Chars74k dataset and the Street View House Numbers dataset

Saigon là số một.

 Then last week Dung Thai,  Vinh Vu, and  Nguyen Quy checked in Python code using Theano that recognizes over 92% of the images correctly, and vaulted us into the #1 spot on the Getting Started with Julia leaderboard.  Congratulations to everyone taking part in Data Science Saigon.

Our Remaining Challenge

So clearly, training with lots more data improved the score.  But the question remains,  would using a CNN in Julia with the additional training data generate a similar score as the Python code?  We hope to find out when we meet again. All of our code is here.  






Thursday, May 14, 2015

Python vs Julia


I really enjoyed this Python or Julia comparison from these Quantitative Economists.  They give insightful advantages and disadvantages for both languages.  I dug up the site because I started to wonder if I'm crazy learning Julia at the same time that I'm working with and still learning Python.   The final statement on their page eased my mind:

Still Can’t Decide?

Learn both — you won’t regret it

Sunday, May 3, 2015

Putting the Train in AppTrain



In late 2005, when first learning Ruby and Rails, I founded the AppTrain project,  a web interface to the early Rails generators.   The Train represented a vehicle that was rolling along on top of Rails. As Rails grew in popularity, we began helping build Rails teams and the train took on a new meaning. We were training developers in Rails and related technologies.





And now  years later,  still excited about the future of technology, we're doing plenty of data science programming and machine learning.  With machine learning, specifically supervised learning, it’s important to build a good training set of of data.  Training sets represent a relationship between a result and a list of data points that correspond with a that result.  The result is also referred to as a signal.  


In supervised learning different algorithms can be trained on these training sets.  When an algorithm is being trained, it is looking for a function that best explains a signal.


Imagine a small data set like this:
[x,y][4,2][6,3][2.1]


The first number on each line is our result, or signal.  The second is the input data that leads to that signal.  Do you see a function that could predict the value of the signal x given a new value for y?


[x,4]


Visually we immediately see that x is always greater than y.  (x > y)  Our minds are searching for a function that will predict x.  How much greater is x than y?   You'll notice pretty quickly, it's exactly double.

x = 2y
x=2(4) 
x = 8


x represents the signal. y is the training data.  


Data Scientists have developed many algorithms that can run through numbers similar to the way our minds do.  But they can do it much faster.  Imagine a training set not with three rows, but with 100 or 1000.  It would be pretty boring to read through them to make sure x was always double y, but it's a great job for a computer.


Complicated Data Sets


Now imagine the training set has not just 2 variables (columns), x and y, but 10, or 100.


Here's data from a training set in the Restaurant Revenue prediction competition at Kaggle.  





In this case the result (or signal) we're trying to predict is the last column in each row, the revenue.


The Python programming language is a favorite of data science programmers. Scikit-learn is the machine learning library for Python. It contains learning algorithms designed be trained on data sets like this restaurant revenue data.  Each algorithm in Scikit Learn looks for functions that predict the signals found in training data. 

To solve Kaggle problems like the restaurant revenue problem, competitors typically first try one of the single models found in scikit-learn. On the discussion board for the competition, people mention using support vector machines (SVM), Gradient Boosting (GBM) , and Random Forrest. But competition winners ultimately blend techniques, or even devise their own algorithms.

Meanwhile, the train cruises forward at AppTrain.  Today we're building training sets, and training algorithms. Want to learn more about machine learning?  Attend our tech talks at Làm việc mạng in Saigon this summer.


Tuesday, April 21, 2015

Efficient Computing with NumPy - Jake Vanderplas by PyData

Jake Vanderplas Jake Vanderplas is an NSF post-doctoral fellow at University of Washington, working jointly between the Computer Science and Astronomy departments. His research involves applying recent advances in machine learning to large astronomical datasets, in order to learn about the Universe at the largest scales. He is co-author of "Statistics, Data Mining, and Machine Learning in Astronomy", a Python-centric textbook to be published by Princeton Press in 2013, and has presented many technical talks and papers in this subject area. In the Python world, Jake is active in maintaining and contributing to several core Python scientific computing packages, including Scikit-learn, Scipy, Matplotlib, and others. He occasionally blogs on python-related topics at http://ift.tt/1nhB2qI. What is PyData? PyData.org is the home for all things related to the use of Python in data management and analysis. This site aims to make open source data science tools easily accessible by listing the links in one location. If you would like to submit a download link or any items to be listed in PyData News, please let us know at: admin@pydata.org

Popular Articles