Showing posts with label Tech Talk Tuesday. Show all posts
Showing posts with label Tech Talk Tuesday. Show all posts

Thursday, July 23, 2015

How Data Science Saigon took the lead in a Kaggle Competition

Last month, at Tech Talk Tuesday,  we formed a team for the Kaggle Competition Getting Started with Julia. Last week, out team Data Science Saigon took the number one spot on the leaderboard.  Here's how it happened.

Entering a Kaggle competition

You've got to be in it to win it.  When our team met on June 16th, we created accounts on Kaggle's site, and on bitbucket by Atlasian, a company with offices here in Saigon.  We reviewed DataFrames with Julia, and got the code from the Julia Tutorial and the K-Nearest-Neighbor tutorial working with help from the Kaggle forums.  In particular,  some method calls have changed since the tutorials were created, but we found the workaround in the Convert method error when reading image files topic.  Our versions work as of Julia 0.3.7.   Implementing the tutorial code got us to about 46th place at the time.

Machine Learning with Julia

At our next meeting we took a look at how to build a predictive model based on the Random Forest algorithm.

But boosting parameters to our Random Forest algorithm didn't drastically improve our score.  This is when we found out about the Mocha package.

Convolutional Neural Networks 

The recent resurgence in popularity of Neural Networks is due the the amazing performance of Convolutional Neural Networks (CNNs) at image classification.  This was exactly the solution our problem needed.   Dung Thai is very knowledgeable about Deep Learning, and encouraged us to try out the Mocha Package for Deep Learning in Julia. As a result we quickly moved into the top 20 on the leaderboard.

Pulling out all the Stops

At our next meeting Dung (pronounced Yung) summarized Learn from the Best , and we talked about how to get to the next level. Data Science Saigon has talent across a variety of platforms and languages including C++, Caffe, Scilab, Python and of course Julia.  We also noticed a few things about the rules for this particular competition:

  1. Outside Data is not forbidden
  2. Semi-supervised learning is not forbidden
  3. The language does not have to be Julia  
We pondered how a Convolutional Network form a good Python library like Theano would perform. We also accessed lots more training images from the Chars74k dataset and the Street View House Numbers dataset

Saigon là số một.

 Then last week Dung Thai,  Vinh Vu, and  Nguyen Quy checked in Python code using Theano that recognizes over 92% of the images correctly, and vaulted us into the #1 spot on the Getting Started with Julia leaderboard.  Congratulations to everyone taking part in Data Science Saigon.

Our Remaining Challenge

So clearly, training with lots more data improved the score.  But the question remains,  would using a CNN in Julia with the additional training data generate a similar score as the Python code?  We hope to find out when we meet again. All of our code is here.  

Monday, June 29, 2015

Technical English Seminar

It's super exciting to be a part of tonight's Technical English Seminar at VTC Academy.  This started off as another Tech Tak Tuesday presentation, but thanks to the support of VTC Academy, we've got quite a crowd coming tonight.  Here's the slides for the first part:

Tuesday, June 9, 2015

Data Frames with Julia

Today's Tech Talk Tuesday is virtual, we'll do a live one next week.
Learn how to code with R like DataFrames in Julia. And see Julia's amazing vectorized assignment operator work on a DataArray.

DataFrames with Julia from AppTrain on Vimeo.

We read a csv file into a DataFrame, then learn how to subset it, and update values in it. 
Code is at

Thursday, June 4, 2015

Tech Talk Tuesday: Reading and Writing Files with Julia

I'm planning a series of these short videos on Julia basics. 

Reading and Writing Files with Julia from AppTrain on Vimeo.

Actually this one is over 7 minutes. I'd like to get them down to under 5, but still getting the hang of this smile emoticon.  Anyway, thanks for coming to the talks, and keep coming, we'll build on the basics covered in the videos.

Monday, June 1, 2015

Tech Talk Tuesday: Starting with Julia

People asked about dialing into a Tech Talk.  Here's the next best thing, Tech Talk Screencasts.  A little unpolished, but here you go:

Starting with Julia from AppTrain Technology on Vimeo.

See how to type your first few lines of Julia code. You'll also learn some key concepts behind the Julia language: Optional Static Typing, multiple dispatch, vectorized operations, and find out what a homoiconic language is. Enjoy!

Saturday, May 16, 2015

Webworking - Làm việc mạng

Webworking,  (Làm việc mạng),  is a group organized to encourage people in Ho Chi Minh City with careers in web technologies, and to support those who's careers need a boost or a change.

Our most popular event, Wednesday morning coffee is an informal OpenCoffee where ambitious young entrepreneurs, developers and investors mingle and learn from each other. We talk about the latest technologies, favorite Saigon coffee shops and how to build careers and businesses in the promising Vietnamese economy.

Now you can participate from afar by sponsoring a coffee session. $25 earns you the gratitude of that weeks attendees and gets your company or organization on the event page.

Sponsor Open Coffee

Thanks for the help building meaningful careers in Saigon!

Sunday, May 3, 2015

Putting the Train in AppTrain

In late 2005, when first learning Ruby and Rails, I founded the AppTrain project,  a web interface to the early Rails generators.   The Train represented a vehicle that was rolling along on top of Rails. As Rails grew in popularity, we began helping build Rails teams and the train took on a new meaning. We were training developers in Rails and related technologies.

And now  years later,  still excited about the future of technology, we're doing plenty of data science programming and machine learning.  With machine learning, specifically supervised learning, it’s important to build a good training set of of data.  Training sets represent a relationship between a result and a list of data points that correspond with a that result.  The result is also referred to as a signal.  

In supervised learning different algorithms can be trained on these training sets.  When an algorithm is being trained, it is looking for a function that best explains a signal.

Imagine a small data set like this:

The first number on each line is our result, or signal.  The second is the input data that leads to that signal.  Do you see a function that could predict the value of the signal x given a new value for y?


Visually we immediately see that x is always greater than y.  (x > y)  Our minds are searching for a function that will predict x.  How much greater is x than y?   You'll notice pretty quickly, it's exactly double.

x = 2y
x = 8

x represents the signal. y is the training data.  

Data Scientists have developed many algorithms that can run through numbers similar to the way our minds do.  But they can do it much faster.  Imagine a training set not with three rows, but with 100 or 1000.  It would be pretty boring to read through them to make sure x was always double y, but it's a great job for a computer.

Complicated Data Sets

Now imagine the training set has not just 2 variables (columns), x and y, but 10, or 100.

Here's data from a training set in the Restaurant Revenue prediction competition at Kaggle.  

In this case the result (or signal) we're trying to predict is the last column in each row, the revenue.

The Python programming language is a favorite of data science programmers. Scikit-learn is the machine learning library for Python. It contains learning algorithms designed be trained on data sets like this restaurant revenue data.  Each algorithm in Scikit Learn looks for functions that predict the signals found in training data. 

To solve Kaggle problems like the restaurant revenue problem, competitors typically first try one of the single models found in scikit-learn. On the discussion board for the competition, people mention using support vector machines (SVM), Gradient Boosting (GBM) , and Random Forrest. But competition winners ultimately blend techniques, or even devise their own algorithms.

Meanwhile, the train cruises forward at AppTrain.  Today we're building training sets, and training algorithms. Want to learn more about machine learning?  Attend our tech talks at Làm việc mạng in Saigon this summer.

Tuesday, April 28, 2015

Modern Customer Relationship Management

At April's Tech Talk Tuesday, we previewed six Modern Customer Relationship Management (CRM) Systems.

There are hundreds of CRM systems available. To focus on 6 , we looked for modern tools meet important criteria:

  • Web based 
  • Customizable 
  • Documented APIs 
  • Social Media Integration 
The six we experimented with are all excellent choices for managing customer databases. When it come to choosing one, the answer lies in choosing the right tool for the right business.


Highrise, from the makers of Basecamp is a Simple tool that works great for any small to medium size business. It integrates well with Basecamp, so for a typical web development shop that already uses Basecamp. this is a natural choice.


In the presentation above, Salesforce is referred to as the 800 pound gorilla of Cloud CRM.  Salesforce is widely implemented, and even has it's own programming language for customization. (APEX).


Zoho is the value play in choosing a CRM, giving you a feature rich CRM system and affordable pricing.  Zoho implementations are increasing faster than any other solution listed here.  They'e the rising star in Cloud solutions.


SugaCRM is is ideal for clients looking to host their own CRM system.  But there's also a hosted solution available.  The product uses what they call a "commercial open souce" license.


Insightly started of as a google app, and as a result integrates well with thing like Google Calendar.  Clients familiar Googles ecosystem should look closely at Insightly.


Nimble is a newcomer, and has a more modern feel than other CRMs listed here.  Nimble has some impressive Social Media integration features, allowing users to easily associate customer with online profiles. A cutting edge startup would play well with Nimble.

It's been a while since we created The App Train infographic, but these are the CRM systems we are looking a t now. Thy're a just a few of the many available.  What are your favorite CRM tools?

Thursday, April 16, 2015

Single Sign on: Managing Authentication with Google, Twitter and Facebook accounts.

Last month on Tech Talk Tuesday we talked about setting up Single Sign on for websites.  Today's web solutions are based on a standard called OAuth2.  

It's not difficult to allow multiple login options to a site.  The Javascript library Passport works great for node.js applications.  The Oauth gem does the job for Ruby on Rails applications. And the WP-Oauth plugin is great for Wordpress sites.

What really impressed us was how easy it was to install and configure Oauth for Meteor applications.  Meteor is an impressive framework for setting up node.js applications quickly.  We'll be exploring it more soon.

Thursday, March 19, 2015

Introduction to Julia

"Julia is a fresh approach to technical computing."  boasts the startup message, flourished with colorful circles hovering above a bubbly ASCII Julia logo.  The formatting effort is not wasted, it's an exuberant promise: Julia will make the command line fun again.
apptrain_1@julia:~/workspace $ julia
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation:
   _ _   _| |_  __ _   |  Type "help()" to list help topics
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.2.1 (2014-02-11 06:30 UTC)
 _/ |\__'_|_|_|\__'_|  |  
|__/                   |  x86_64-linux-gnu


Julia was created by four Data Scientists from MIT who began working on it around 2011.  The language is beginning to mature at a time when the Data Scientist job title is popping up on resumes as fast as Data Scientist jobs appear.  The timing is excellent.   R programming, an offshoot of S Programming , is the language of choice for today's mathematical programmer.  But it feels clunky, like a car from the last century. While Julia may not unseat R in the world of Data Analysis,  plans don't stop there.

If you want to code along with the examples in this article, jump to Getting Started with Julia and chose one of the three options to start coding.

Julia is a general purpose programming language.  It's creators have noble goals.  They want a language that is fast like C, they want it flexible with cool metaprograming capabilities like Ruby, they want parallel and distributed computing like Scala, and true Mathematical equations like MATLAB.

Why program in Julia?

1) Julia is Fast

Julia already boasts faster matrix multiplication and sorting than Go and Java.  It uses the LLVM compiler, which languages like GO use for fast compilation.   Julia uses just in time (JIT) compilation to machine code , and often achieves C like performance numbers.

2) Julia is written in Julia

Contributors need only work with a single language, which makes it easier for Julia users to become core contributors. 
"As a policy, we try to never resort to implementing things in C. This keeps us honest – we have to make Julia fast enough to allow us to do that" -Stephan Karpinski

And, as the languages co-creator Karpinski notes in the comments of the referenced post,   Writing the language itself in Julia means that when improvements are made to the compiler, both the system and user code gets faster.

3) Julia is Powerful

Like most programming languages, it's implementation is Open Source.  Anyone can work on the language or the documentation.  And like most modern programming languages, Julia has extensive metaprogramming support.  It's creators attribute the Lisp language for their inspiration:
Like Lisp, Julia represents its own code as a data structure of the language itself.
a) Optional Strong Typing
Using strong typing can speed up compiling, but Julia keeps strong typing optional, which frees up programmers who want to write dynamic routines that work on multiple types. 
julia> @code_typed(sort(arry))
1-element Array{Any,1}:
 :($(Expr(:lambda, {:v}, {{symbol("#s1939"),symbol("#s1924")},{{:v,Array{Float64,1},0},{symbol("#s1939"),Array{Any,1},18},{symbol("#s1924"),Array{Any,1},18}},{}}, :(begin $(Expr(:line, 358, symbol("sort.jl"), symbol("")))
        #s1939 = (top(ccall))(:jl_alloc_array_1d,$(Expr(:call1, :(top(apply_type)), :Array, Any, 1))::Type{Array{Any,1}},$(Expr(:call1, :(top(tuple)), :Any, :Int))::(Type{Any},Type{Int64}),Array{Any,1},0,0,0)::Array{Any,1}
        #s1924 = #s1939::Array{Any,1}
        return __sort#77__(#s1924::Array{Any,1},v::Array{Float64,1})::Array{Float64,1}

b) Introspective

Julia's introspection is awesome, particularly if you enjoy looking at native assembler code. Dissecting assembler code comes in handy when optimizing algorithms. Julia programmers have several introspection functions for optimization. Here the code_native method shows the recursive nature of a binary sort algorithm.
julia> code_native(sort,(Array{Int,1},))
        push    RBP
        mov     RBP, RSP
        push    R14
        push    RBX
        sub     RSP, 48
        mov     QWORD PTR [RBP - 56], 6
        movabs  R14, 139889005508848
        mov     RAX, QWORD PTR [R14]
        mov     QWORD PTR [RBP - 48], RAX
        lea     RAX, QWORD PTR [RBP - 56]
        mov     QWORD PTR [R14], RAX
        xorps   XMM0, XMM0
        movups  XMMWORD PTR [RBP - 40], XMM0
        mov     QWORD PTR [RBP - 24], 0
        mov     RBX, QWORD PTR [RSI]
        movabs  RAX, 139888990457040
        mov     QWORD PTR [RBP - 32], 28524096
        mov     EDI, 28524096
        xor     ESI, ESI
        call    RAX
        lea     RSI, QWORD PTR [RBP - 32]
        movabs  RCX, 139889006084144
        mov     QWORD PTR [RBP - 40], RAX
        mov     QWORD PTR [RBP - 32], RAX
        mov     QWORD PTR [RBP - 24], RBX
        mov     EDI, 128390064
        mov     EDX, 2
        call    RCX
        mov     RCX, QWORD PTR [RBP - 48]
        mov     QWORD PTR [R14], RCX
        add     RSP, 48
        pop     RBX
        pop     R14
        pop     RBP

c) Multiple Dispatch

Multiple dispatch allows Object Oriented behavior.  Each function can have several  methods designed to operate on the types of the method parameters. The appropriate method is dispatched at runtime based on the parameter types.

julia> methods(sort)
# 4 methods for generic function "sort":
sort(r::UnitRange{T<:real at="" bstractarray="" dim::integer="" pre="" r::range="" range.jl:533="" range.jl:536="" sort.jl:358="" sort.jl:368="" sort="" v::abstractarray="">

Popular Articles