Thursday, March 19, 2015

Introduction to Julia

"Julia is a fresh approach to technical computing."  boasts the startup message, flourished with colorful circles hovering above a bubbly ASCII Julia logo.  The formatting effort is not wasted, it's an exuberant promise: Julia will make the command line fun again.
apptrain_1@julia:~/workspace $ julia
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "help()" to list help topics
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.2.1 (2014-02-11 06:30 UTC)
 _/ |\__'_|_|_|\__'_|  |  
|__/                   |  x86_64-linux-gnu

julia> 

Julia was created by four Data Scientists from MIT who began working on it around 2011.  The language is beginning to mature at a time when the Data Scientist job title is popping up on resumes as fast as Data Scientist jobs appear.  The timing is excellent.   R programming, an offshoot of S Programming , is the language of choice for today's mathematical programmer.  But it feels clunky, like a car from the last century. While Julia may not unseat R in the world of Data Analysis,  plans don't stop there.

If you want to code along with the examples in this article, jump to Getting Started with Julia and chose one of the three options to start coding.

Julia is a general purpose programming language.  It's creators have noble goals.  They want a language that is fast like C, they want it flexible with cool metaprograming capabilities like Ruby, they want parallel and distributed computing like Scala, and true Mathematical equations like MATLAB.



Why program in Julia?

1) Julia is Fast


Julia already boasts faster matrix multiplication and sorting than Go and Java.  It uses the LLVM compiler, which languages like GO use for fast compilation.   Julia uses just in time (JIT) compilation to machine code , and often achieves C like performance numbers.

2) Julia is written in Julia

Contributors need only work with a single language, which makes it easier for Julia users to become core contributors. 
"As a policy, we try to never resort to implementing things in C. This keeps us honest – we have to make Julia fast enough to allow us to do that" -Stephan Karpinski

And, as the languages co-creator Karpinski notes in the comments of the referenced post,   Writing the language itself in Julia means that when improvements are made to the compiler, both the system and user code gets faster.

3) Julia is Powerful

Like most programming languages, it's implementation is Open Source.  Anyone can work on the language or the documentation.  And like most modern programming languages, Julia has extensive metaprogramming support.  It's creators attribute the Lisp language for their inspiration:
Like Lisp, Julia represents its own code as a data structure of the language itself.
a) Optional Strong Typing
Using strong typing can speed up compiling, but Julia keeps strong typing optional, which frees up programmers who want to write dynamic routines that work on multiple types. 
julia> @code_typed(sort(arry))
1-element Array{Any,1}:
 :($(Expr(:lambda, {:v}, {{symbol("#s1939"),symbol("#s1924")},{{:v,Array{Float64,1},0},{symbol("#s1939"),Array{Any,1},18},{symbol("#s1924"),Array{Any,1},18}},{}}, :(begin $(Expr(:line, 358, symbol("sort.jl"), symbol("")))
        #s1939 = (top(ccall))(:jl_alloc_array_1d,$(Expr(:call1, :(top(apply_type)), :Array, Any, 1))::Type{Array{Any,1}},$(Expr(:call1, :(top(tuple)), :Any, :Int))::(Type{Any},Type{Int64}),Array{Any,1},0,0,0)::Array{Any,1}
        #s1924 = #s1939::Array{Any,1}
        return __sort#77__(#s1924::Array{Any,1},v::Array{Float64,1})::Array{Float64,1}
    end::Array{Float64,1}))))

b) Introspective


Julia's introspection is awesome, particularly if you enjoy looking at native assembler code. Dissecting assembler code comes in handy when optimizing algorithms. Julia programmers have several introspection functions for optimization. Here the code_native method shows the recursive nature of a binary sort algorithm.
julia> code_native(sort,(Array{Int,1},))
        .text
        push    RBP
        mov     RBP, RSP
        push    R14
        push    RBX
        sub     RSP, 48
        mov     QWORD PTR [RBP - 56], 6
        movabs  R14, 139889005508848
        mov     RAX, QWORD PTR [R14]
        mov     QWORD PTR [RBP - 48], RAX
        lea     RAX, QWORD PTR [RBP - 56]
        mov     QWORD PTR [R14], RAX
        xorps   XMM0, XMM0
        movups  XMMWORD PTR [RBP - 40], XMM0
        mov     QWORD PTR [RBP - 24], 0
        mov     RBX, QWORD PTR [RSI]
        movabs  RAX, 139888990457040
        mov     QWORD PTR [RBP - 32], 28524096
        mov     EDI, 28524096
        xor     ESI, ESI
        call    RAX
        lea     RSI, QWORD PTR [RBP - 32]
        movabs  RCX, 139889006084144
        mov     QWORD PTR [RBP - 40], RAX
        mov     QWORD PTR [RBP - 32], RAX
        mov     QWORD PTR [RBP - 24], RBX
        mov     EDI, 128390064
        mov     EDX, 2
        call    RCX
        mov     RCX, QWORD PTR [RBP - 48]
        mov     QWORD PTR [R14], RCX
        add     RSP, 48
        pop     RBX
        pop     R14
        pop     RBP
        ret

c) Multiple Dispatch

Multiple dispatch allows Object Oriented behavior.  Each function can have several  methods designed to operate on the types of the method parameters. The appropriate method is dispatched at runtime based on the parameter types.

julia> methods(sort)
# 4 methods for generic function "sort":
sort(r::UnitRange{T<:real at="" bstractarray="" dim::integer="" pre="" r::range="" range.jl:533="" range.jl:536="" sort.jl:358="" sort.jl:368="" sort="" v::abstractarray="">


4) Julia is great for Data Science

a) Readable Mathematics
Mathematical expressions are written the way a Mathematician would write them.  
julia> 3*2/12
0.5
julia> x = 1
1
julia> x += 3
4
julia> x
4
julia> 1 != 2
true
julia> 1 == 1.0
true
julia> 1 < 2
true

There's also the standard mathematical functions you'd expect, and plenty more:
julia> sqrt(45)
6.708203932499369

julia> exp(4)
54.598150033144236

julia> e^4
54.598150033144236

julia> sin(90) + cos(90)
0.44592304747138767

julia> log(90)
4.499809670330265

julia> log10(90)
1.954242509439325 
b) Matrix Magic
A matrix in Julia is represented as a 2 dimensional Array.  Note the type, {Float64, 2} of this 10x3 array of random numbers.

julia> rand(10,3)
10x3 Array{Float64,2}:
 0.348709   0.419119    0.871724  
 0.789848   0.00412957  0.0574721 
 0.73065    0.0367485   0.374954  
 0.0310078  0.363313    0.0296657 
 0.871963   0.496541    0.954185  
 0.169939   0.259749    0.642488  
 0.863924   0.263358    0.11318   
 0.691642   0.510957    0.00921175
 0.162715   0.127321    0.953208  
 0.673151   0.53509     0.0573474 

Julia contains many ways of quickly generating and concating Arrays.  The randn()  function generates numbers along a normal distribution.  The eye() function generates an Identity matrix. Putting both statements in brackets [ ] concatenates them horizontally.
julia> [randn(10,10)  eye(10)]
10x20 Array{Float64,2}:
 -1.2901    -0.0419352  -1.0193     -0.624864    1.42954   -0.774478   …   0.187148  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 -0.817845  -0.226443    0.197861   -0.773366    0.255801   1.14292       -1.1255    0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
  0.211343  -1.22742     0.560432   -0.06335     0.154775  -0.633967      -0.925349  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 -1.77692   -2.97158     0.79589    -0.0582595  -0.655429  -2.69089       -0.82891   0.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0
  1.06179   -0.725923    1.88578     1.32889    -1.24385    1.48486        0.56051   0.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0
  0.136371  -1.12339     0.991863    1.01313     1.28505    0.0463603  …   0.767006  0.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0
  0.762102   0.572324   -0.343285    0.870918    1.37059    0.946939       0.332733  0.0  0.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0
  0.569224   1.57083    -0.0563069   0.985498    1.42752   -0.0975067      1.30598   0.0  0.0  0.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0
  0.235689  -0.3421      1.68326    -0.38728     0.211276  -0.588577      -1.26163   0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  1.0  0.0
  0.221076  -1.41407    -0.129416   -0.341076    0.514588  -0.591832      -1.3531    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  1.0

Using a semicolon to separate the functions instead of a space gives a vertical concat:
julia> [ones(10,10); eye(10)]
20x10 Array{Float64,2}:
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  1.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  1.0

One powerful way of generating an Array is Comprehensions. See if you can comprehend a Comprehension from this code:
julia> [x^y  for x in 1:5, y in rand(5)]
5x5 Array{Float64,2}:
 1.0      1.0      1.0      1.0      1.0    
 1.53691  1.20257  1.56169  1.45627  1.25234
 1.9762   1.33959  2.02693  1.81439  1.42853
 2.36209  1.44617  2.43887  2.12071  1.56837
 2.71259  1.53465  2.81522  2.3935   1.68619


c) Data Frames



Data Frames are a fundamental part of any Data Science language  In Julia, they behave similar to the way they work in R Programming.  Speaking of R, the RDatasets package makes many datasets available from R.

julia> Pkg.add("RDatasets")
INFO: Cloning cache of ArrayViews from git://github.com/JuliaLang/ArrayViews.jl.git
INFO: Cloning cache of DataArrays from git://github.com/JuliaStats/DataArrays.jl.git
INFO: Cloning cache of DataFrames from git://github.com/JuliaStats/DataFrames.jl.git
INFO: Cloning cache of GZip from git://github.com/JuliaLang/GZip.jl.git
INFO: Cloning cache of RDatasets from git://github.com/johnmyleswhite/RDatasets.jl.git




Notice loading RDatasets also loaded the DataFrames package, which we needed anyway. The RDatasets.packages() method lists all available packages.

Here's a look at an airquality database in the datasets package:
using RDatasets
airq = dataset("datasets", "airquality")
153x6 DataFrame
| Row | Ozone | Solar_R | Wind | Temp | Month | Day |
|-----|-------|---------|------|------|-------|-----|
| 1   | 41    | 190     | 7.4  | 67   | 5     | 1   |
| 2   | 36    | 118     | 8.0  | 72   | 5     | 2   |
| 3   | 12    | 149     | 12.6 | 74   | 5     | 3   |
| 4   | 18    | 313     | 11.5 | 62   | 5     | 4   |
| 5   | NA    | NA      | 14.3 | 56   | 5     | 5   |
| 6   | 28    | NA      | 14.9 | 66   | 5     | 6   |
| 7   | 23    | 299     | 8.6  | 65   | 5     | 7   |
| 8   | 19    | 99      | 13.8 | 59   | 5     | 8   |
| 9   | 8     | 19      | 20.1 | 61   | 5     | 9   |
| 10  | NA    | 194     | 8.6  | 69   | 5     | 10  |
| 11  | 7     | NA      | 6.9  | 74   | 5     | 11  |
| 12  | 16    | 256     | 9.7  | 69   | 5     | 12  |
⋮
| 141 | 13    | 27      | 10.3 | 76   | 9     | 18  |
| 142 | 24    | 238     | 10.3 | 68   | 9     | 19  |
| 143 | 16    | 201     | 8.0  | 82   | 9     | 20  |
| 144 | 13    | 238     | 12.6 | 64   | 9     | 21  |
| 145 | 23    | 14      | 9.2  | 71   | 9     | 22  |
| 146 | 36    | 139     | 10.3 | 81   | 9     | 23  |
| 147 | 7     | 49      | 10.3 | 69   | 9     | 24  |
| 148 | 14    | 20      | 16.6 | 63   | 9     | 25  |
| 149 | 30    | 193     | 6.9  | 70   | 9     | 26  |
| 150 | NA    | 145     | 13.2 | 77   | 9     | 27  |
| 151 | 14    | 191     | 14.3 | 75   | 9     | 28  |
| 152 | 18    | 131     | 8.0  | 76   | 9     | 29  |
| 153 | 20    | 223     | 11.5 | 68   | 9     | 30  |



DataFrames can be subset by filtering both columns and rows. The any line gets all rows that have NA values (isna).  The tilda (~) removes those rows from the dataframe. 

julia> # Get rid of NAs

julia> newair = airq[~any(colwise(isna, airq),), :]
111x6 DataFrame
| Row | Ozone | Solar_R | Wind | Temp | Month | Day |
|-----|-------|---------|------|------|-------|-----|
| 1   | 41    | 190     | 7.4  | 67   | 5     | 1   |
| 2   | 36    | 118     | 8.0  | 72   | 5     | 2   |
| 3   | 12    | 149     | 12.6 | 74   | 5     | 3   |
| 4   | 18    | 313     | 11.5 | 62   | 5     | 4   |
| 5   | 23    | 299     | 8.6  | 65   | 5     | 7   |
| 6   | 19    | 99      | 13.8 | 59   | 5     | 8   |
| 7   | 8     | 19      | 20.1 | 61   | 5     | 9   |
| 8   | 16    | 256     | 9.7  | 69   | 5     | 12  |
| 9   | 11    | 290     | 9.2  | 66   | 5     | 13  |
| 10  | 14    | 274     | 10.9 | 68   | 5     | 14  |
| 11  | 18    | 65      | 13.2 | 58   | 5     | 15  |
| 12  | 14    | 334     | 11.5 | 64   | 5     | 16  |
⋮
| 99  | 18    | 224     | 13.8 | 67   | 9     | 17  |
| 100 | 13    | 27      | 10.3 | 76   | 9     | 18  |
| 101 | 24    | 238     | 10.3 | 68   | 9     | 19  |
| 102 | 16    | 201     | 8.0  | 82   | 9     | 20  |
| 103 | 13    | 238     | 12.6 | 64   | 9     | 21  |
| 104 | 23    | 14      | 9.2  | 71   | 9     | 22  |
| 105 | 36    | 139     | 10.3 | 81   | 9     | 23  |
| 106 | 7     | 49      | 10.3 | 69   | 9     | 24  |
| 107 | 14    | 20      | 16.6 | 63   | 9     | 25  |
| 108 | 30    | 193     | 6.9  | 70   | 9     | 26  |
| 109 | 14    | 191     | 14.3 | 75   | 9     | 28  |
| 110 | 18    | 131     | 8.0  | 76   | 9     | 29  |
| 111 | 20    | 223     | 11.5 | 68   | 9     | 30  |


Here we get a subset of rows that have Ozone levels greater than 90.  The period before the greaer than symbol tells Julia to broadcast the comparison to each member of the Array of Ozone values.  Notice that the subset logic between the [] brackets is on the left side of the colon.  Filters left of the colon select rows, filters right of the colon select columns. 
julia> newair[newair[:Ozone] .> 90,:]
11x6 DataFrame
| Row | Ozone | Solar_R | Wind | Temp | Month | Day |
|-----|-------|---------|------|------|-------|-----|
| 1   | 115   | 223     | 5.7  | 79   | 5     | 30  |
| 2   | 135   | 269     | 4.1  | 84   | 7     | 1   |
| 3   | 97    | 267     | 6.3  | 92   | 7     | 8   |
| 4   | 97    | 272     | 5.7  | 92   | 7     | 9   |
| 5   | 108   | 223     | 8.0  | 85   | 7     | 25  |
| 6   | 122   | 255     | 4.0  | 89   | 8     | 7   |
| 7   | 110   | 207     | 8.0  | 90   | 8     | 9   |
| 8   | 168   | 238     | 3.4  | 81   | 8     | 25  |
| 9   | 118   | 225     | 2.3  | 94   | 8     | 29  |
| 10  | 96    | 167     | 6.9  | 91   | 9     | 1   |
| 11  | 91    | 189     | 4.6  | 93   | 9     | 4   |

d) Graphs
There are several packages available for plotting DataFrames.  Gadfly is similar to ggplot in R.



julia> using Gadfly
plot(x=newair[:Ozone],y=newair[:Day])



For a more detailed graph, pass some optional parameters.
julia> air_graph = plot(y=newair[:Ozone],x=newair[:Day],Geom.smooth,Guide.xlabel("Days"),Guide.ylabel("Ozone Level"))


To store the plot as an image:
julia> draw(PNG("air_graph.png", 6inch, 4inch), air_graph)

5) Julia is a general purpose programming language.


As impressive as Julia is for Data Science, perhaps the most exciting thing about learning it is the potential to use it often.  Packages for web technologies are highlighted well in the Julia Webstack. Hosting a RESTful webservice similar to running one in Ruby Sinatra takes just a few lines of code.

a) WebServices

In [1]:
Pkg.add("Morsel")
In [ ]:
using Morsel
app = Morsel.app()
route(app, GET | POST | PUT, "/") do req, res
    "This is the root"
end
get(app, "/about") do req, res
    "This app is running on Morsel"
end
start(app, 8004)
b) Sockets
It's similarly easy to run a chat server using http sockets.


julia> using HttpServer
WARNING: Dict-based `@docstring` config is deprecated. Use keywords instead.

julia> using WebSockets

julia> wsh = WebSocketHandler() do req,client
               while true
                   msg = read(client)
                   write(client, msg)
               end
             end
WebSocketHandler((anonymous function))

julia> server = Server(wsh)
Server(HttpHandler((anonymous function),TcpServer(init),["error"=>(anonymous function),"listen"=>(anonymous function)]),WebSocketHandler((anonymous function)))

julia> run(server,8080)
Listening on 8080...

c) Packages and Deployment
Julia's Package management system is a pleasure, taking many good ideas from predecessors like R.  You can get an idea of the growing number of packages available from the Package Ecosystem Pulse.



Deployment is somewhat uncharted, bu thanks to Docker, Deploying Julia can be surprisingly easy.  Still as the author notes, there is still work to be done creating deployment standards. Deployment best practices will evolve as Julia grows in popularity.



Getting started with Julia.

a) Code in the browser
 One of the coolest things you'll see when learning Julia is Juliabox.  You can mix markup and executable code in the same document. If you try Juliabox, you can upload the Introduction to Julia code used in  this article to your own instance.

b) Cloud 9 IDE
https://c9.io/
Cloud 9 is and excellent IDE.  It has some predefined Julia runners and Julia code formatting built in.  It's also free to create an account.

c) Install Julia
Julia installs easily on your local machine.  I run it on a Chromebook/Linux using Crouton .


2 comments:

Michael Blake said...

Interesting to see Julia is already ruffling feathers:
https://matloff.wordpress.com/2014/05/21/r-beats-python-r-beats-julia-anyone-else-wanna-challenge-r/

It's true that the old niche platforms like R weren't really designed for todays multi-core processors and cloud strategies. But Julia is designed to be a general purpose programming that happens to be very good at technical computing. It's much more than a statistical package, but is a general purpose programming language. R is more comparable to it's statistical peers such as Matlab, Scilab and Octave.

Still it's a good sign that it's seen as disruptive.

Hảo Nguyễn Duy said...

Thank you! It's very beautiful code.

Popular Articles