Showing posts with label Data Science. Show all posts
Showing posts with label Data Science. Show all posts

Thursday, July 23, 2015

How Data Science Saigon took the lead in a Kaggle Competition

Last month, at Tech Talk Tuesday,  we formed a team for the Kaggle Competition Getting Started with Julia. Last week, out team Data Science Saigon took the number one spot on the leaderboard.  Here's how it happened.

Entering a Kaggle competition

You've got to be in it to win it.  When our team met on June 16th, we created accounts on Kaggle's site, and on bitbucket by Atlasian, a company with offices here in Saigon.  We reviewed DataFrames with Julia, and got the code from the Julia Tutorial and the K-Nearest-Neighbor tutorial working with help from the Kaggle forums.  In particular,  some method calls have changed since the tutorials were created, but we found the workaround in the Convert method error when reading image files topic.  Our versions work as of Julia 0.3.7.   Implementing the tutorial code got us to about 46th place at the time.

Machine Learning with Julia

At our next meeting we took a look at how to build a predictive model based on the Random Forest algorithm.

But boosting parameters to our Random Forest algorithm didn't drastically improve our score.  This is when we found out about the Mocha package.

Convolutional Neural Networks 

The recent resurgence in popularity of Neural Networks is due the the amazing performance of Convolutional Neural Networks (CNNs) at image classification.  This was exactly the solution our problem needed.   Dung Thai is very knowledgeable about Deep Learning, and encouraged us to try out the Mocha Package for Deep Learning in Julia. As a result we quickly moved into the top 20 on the leaderboard.

Pulling out all the Stops

At our next meeting Dung (pronounced Yung) summarized Learn from the Best , and we talked about how to get to the next level. Data Science Saigon has talent across a variety of platforms and languages including C++, Caffe, Scilab, Python and of course Julia.  We also noticed a few things about the rules for this particular competition:

  1. Outside Data is not forbidden
  2. Semi-supervised learning is not forbidden
  3. The language does not have to be Julia  
We pondered how a Convolutional Network form a good Python library like Theano would perform. We also accessed lots more training images from the Chars74k dataset and the Street View House Numbers dataset

Saigon là số một.

 Then last week Dung Thai,  Vinh Vu, and  Nguyen Quy checked in Python code using Theano that recognizes over 92% of the images correctly, and vaulted us into the #1 spot on the Getting Started with Julia leaderboard.  Congratulations to everyone taking part in Data Science Saigon.

Our Remaining Challenge

So clearly, training with lots more data improved the score.  But the question remains,  would using a CNN in Julia with the additional training data generate a similar score as the Python code?  We hope to find out when we meet again. All of our code is here.  

Monday, July 20, 2015

A Brief History of Neural Networks

In computer science and specifically machine learning, programmers have been trying to simulate the behavior of the brain since the late 1940s.  The fundamental pattern of the brain is modeled by programmers as loosely connected nodes capable of learning and modifying their behavior as information is processed. 

In 1948 Alan Turing's paper Intelligent Machinery called these loosely connected nodes unorganized machines, and compared them to an infant's brain.
Neural Network used in a BioWall

In the 1950s Frank Rosenblatt developed the Perceptron, a binary classification algorithm and one of the first implementations of a Neural Network.  Programmers soon realized that neural networks were only very effective if they had 2 or more layers.  But machine processing power at the time prevented implementing anything useful.

By the 1970s machines had improved, and Neural Networks again gathered interest, but were soon surpassed in utility by simpler classification algorithms such as Support Vector Machines and linear classifiers.
Samples from two classes.  Samples on the margin between them are called support vectors.

This century, Neural Networks have made strides again with the invention of Deep LearningGeoffrey E. Hinton of the University of Toronto improved classification results by training each layer of a neural network separately.  As a result, many classification competitions are now won using Deep Neural Networks, often running on GPU processors.

Scilab , Python's Theano ,  Julia's Mocha,  and Caffe are all  focused on deep learning and neural networks.  Watch these projects evolve as deep learning gathers momentum.

Tuesday, June 9, 2015

Data Frames with Julia

Today's Tech Talk Tuesday is virtual, we'll do a live one next week.
Learn how to code with R like DataFrames in Julia. And see Julia's amazing vectorized assignment operator work on a DataArray.

DataFrames with Julia from AppTrain on Vimeo.

We read a csv file into a DataFrame, then learn how to subset it, and update values in it. 
Code is at

Thursday, June 4, 2015

Tech Talk Tuesday: Reading and Writing Files with Julia

I'm planning a series of these short videos on Julia basics. 

Reading and Writing Files with Julia from AppTrain on Vimeo.

Actually this one is over 7 minutes. I'd like to get them down to under 5, but still getting the hang of this smile emoticon.  Anyway, thanks for coming to the talks, and keep coming, we'll build on the basics covered in the videos.

Monday, June 1, 2015

Tech Talk Tuesday: Starting with Julia

People asked about dialing into a Tech Talk.  Here's the next best thing, Tech Talk Screencasts.  A little unpolished, but here you go:

Starting with Julia from AppTrain Technology on Vimeo.

See how to type your first few lines of Julia code. You'll also learn some key concepts behind the Julia language: Optional Static Typing, multiple dispatch, vectorized operations, and find out what a homoiconic language is. Enjoy!

Sunday, May 24, 2015

Wednesday Morning Coffee Sponsor: Speedment

A special thanks goes to Speedment for sponsoring last week's Wednesday Morning Coffee at Breadtalk.  Speedment is an open source Object Relational Mapper (ORM).  An ORM makes it easy for Object oriented programmers and programs to persist objects to a traditional SQL database.  Speedment is an accelerated ORM.  So aside from simplifying the database mapping,  Speedment speeds up query response times.  It's also a graph database,so it specializes in large datasets, and excels with data where relationships between nodes are queried.

This little hare is Spire, the Speedment mascot.  I've installed Speedment several times, and it's very effective for projects where there will be lot's of queries on your data, and where relationships are important.   It's a honor to have a company that understands the importance of relationships sponsor our coffee.

Thursday, May 14, 2015

Python vs Julia

I really enjoyed this Python or Julia comparison from these Quantitative Economists.  They give insightful advantages and disadvantages for both languages.  I dug up the site because I started to wonder if I'm crazy learning Julia at the same time that I'm working with and still learning Python.   The final statement on their page eased my mind:

Still Can’t Decide?

Learn both — you won’t regret it

Friday, May 8, 2015

Data Science with Python

At the last Tech Talk Tuesday we took an overview of Python's  Data Science related packages.

The key packages for numerical computing are Numpy, Scipy and Scikit-learn.  The documentation for python is great, and makes presentations like this easy.  These packages are loaded with code samples, even for complex concepts like  Grid search and cross validation.    The machine learning package, scikit-learn also has exercises below the code samples.  Doing the exercises enforces the concepts, and is great preparation for solving problems like the ones in Kaggle competitions.

We also demoed iPython Notebooks, a fantastic way to create live data analysis documents.

Sunday, May 3, 2015

Putting the Train in AppTrain

In late 2005, when first learning Ruby and Rails, I founded the AppTrain project,  a web interface to the early Rails generators.   The Train represented a vehicle that was rolling along on top of Rails. As Rails grew in popularity, we began helping build Rails teams and the train took on a new meaning. We were training developers in Rails and related technologies.

And now  years later,  still excited about the future of technology, we're doing plenty of data science programming and machine learning.  With machine learning, specifically supervised learning, it’s important to build a good training set of of data.  Training sets represent a relationship between a result and a list of data points that correspond with a that result.  The result is also referred to as a signal.  

In supervised learning different algorithms can be trained on these training sets.  When an algorithm is being trained, it is looking for a function that best explains a signal.

Imagine a small data set like this:

The first number on each line is our result, or signal.  The second is the input data that leads to that signal.  Do you see a function that could predict the value of the signal x given a new value for y?


Visually we immediately see that x is always greater than y.  (x > y)  Our minds are searching for a function that will predict x.  How much greater is x than y?   You'll notice pretty quickly, it's exactly double.

x = 2y
x = 8

x represents the signal. y is the training data.  

Data Scientists have developed many algorithms that can run through numbers similar to the way our minds do.  But they can do it much faster.  Imagine a training set not with three rows, but with 100 or 1000.  It would be pretty boring to read through them to make sure x was always double y, but it's a great job for a computer.

Complicated Data Sets

Now imagine the training set has not just 2 variables (columns), x and y, but 10, or 100.

Here's data from a training set in the Restaurant Revenue prediction competition at Kaggle.  

In this case the result (or signal) we're trying to predict is the last column in each row, the revenue.

The Python programming language is a favorite of data science programmers. Scikit-learn is the machine learning library for Python. It contains learning algorithms designed be trained on data sets like this restaurant revenue data.  Each algorithm in Scikit Learn looks for functions that predict the signals found in training data. 

To solve Kaggle problems like the restaurant revenue problem, competitors typically first try one of the single models found in scikit-learn. On the discussion board for the competition, people mention using support vector machines (SVM), Gradient Boosting (GBM) , and Random Forrest. But competition winners ultimately blend techniques, or even devise their own algorithms.

Meanwhile, the train cruises forward at AppTrain.  Today we're building training sets, and training algorithms. Want to learn more about machine learning?  Attend our tech talks at Làm việc mạng in Saigon this summer.

Thursday, March 19, 2015

Introduction to Julia

"Julia is a fresh approach to technical computing."  boasts the startup message, flourished with colorful circles hovering above a bubbly ASCII Julia logo.  The formatting effort is not wasted, it's an exuberant promise: Julia will make the command line fun again.
apptrain_1@julia:~/workspace $ julia
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation:
   _ _   _| |_  __ _   |  Type "help()" to list help topics
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.2.1 (2014-02-11 06:30 UTC)
 _/ |\__'_|_|_|\__'_|  |  
|__/                   |  x86_64-linux-gnu


Julia was created by four Data Scientists from MIT who began working on it around 2011.  The language is beginning to mature at a time when the Data Scientist job title is popping up on resumes as fast as Data Scientist jobs appear.  The timing is excellent.   R programming, an offshoot of S Programming , is the language of choice for today's mathematical programmer.  But it feels clunky, like a car from the last century. While Julia may not unseat R in the world of Data Analysis,  plans don't stop there.

If you want to code along with the examples in this article, jump to Getting Started with Julia and chose one of the three options to start coding.

Julia is a general purpose programming language.  It's creators have noble goals.  They want a language that is fast like C, they want it flexible with cool metaprograming capabilities like Ruby, they want parallel and distributed computing like Scala, and true Mathematical equations like MATLAB.

Why program in Julia?

1) Julia is Fast

Julia already boasts faster matrix multiplication and sorting than Go and Java.  It uses the LLVM compiler, which languages like GO use for fast compilation.   Julia uses just in time (JIT) compilation to machine code , and often achieves C like performance numbers.

2) Julia is written in Julia

Contributors need only work with a single language, which makes it easier for Julia users to become core contributors. 
"As a policy, we try to never resort to implementing things in C. This keeps us honest – we have to make Julia fast enough to allow us to do that" -Stephan Karpinski

And, as the languages co-creator Karpinski notes in the comments of the referenced post,   Writing the language itself in Julia means that when improvements are made to the compiler, both the system and user code gets faster.

3) Julia is Powerful

Like most programming languages, it's implementation is Open Source.  Anyone can work on the language or the documentation.  And like most modern programming languages, Julia has extensive metaprogramming support.  It's creators attribute the Lisp language for their inspiration:
Like Lisp, Julia represents its own code as a data structure of the language itself.
a) Optional Strong Typing
Using strong typing can speed up compiling, but Julia keeps strong typing optional, which frees up programmers who want to write dynamic routines that work on multiple types. 
julia> @code_typed(sort(arry))
1-element Array{Any,1}:
 :($(Expr(:lambda, {:v}, {{symbol("#s1939"),symbol("#s1924")},{{:v,Array{Float64,1},0},{symbol("#s1939"),Array{Any,1},18},{symbol("#s1924"),Array{Any,1},18}},{}}, :(begin $(Expr(:line, 358, symbol("sort.jl"), symbol("")))
        #s1939 = (top(ccall))(:jl_alloc_array_1d,$(Expr(:call1, :(top(apply_type)), :Array, Any, 1))::Type{Array{Any,1}},$(Expr(:call1, :(top(tuple)), :Any, :Int))::(Type{Any},Type{Int64}),Array{Any,1},0,0,0)::Array{Any,1}
        #s1924 = #s1939::Array{Any,1}
        return __sort#77__(#s1924::Array{Any,1},v::Array{Float64,1})::Array{Float64,1}

b) Introspective

Julia's introspection is awesome, particularly if you enjoy looking at native assembler code. Dissecting assembler code comes in handy when optimizing algorithms. Julia programmers have several introspection functions for optimization. Here the code_native method shows the recursive nature of a binary sort algorithm.
julia> code_native(sort,(Array{Int,1},))
        push    RBP
        mov     RBP, RSP
        push    R14
        push    RBX
        sub     RSP, 48
        mov     QWORD PTR [RBP - 56], 6
        movabs  R14, 139889005508848
        mov     RAX, QWORD PTR [R14]
        mov     QWORD PTR [RBP - 48], RAX
        lea     RAX, QWORD PTR [RBP - 56]
        mov     QWORD PTR [R14], RAX
        xorps   XMM0, XMM0
        movups  XMMWORD PTR [RBP - 40], XMM0
        mov     QWORD PTR [RBP - 24], 0
        mov     RBX, QWORD PTR [RSI]
        movabs  RAX, 139888990457040
        mov     QWORD PTR [RBP - 32], 28524096
        mov     EDI, 28524096
        xor     ESI, ESI
        call    RAX
        lea     RSI, QWORD PTR [RBP - 32]
        movabs  RCX, 139889006084144
        mov     QWORD PTR [RBP - 40], RAX
        mov     QWORD PTR [RBP - 32], RAX
        mov     QWORD PTR [RBP - 24], RBX
        mov     EDI, 128390064
        mov     EDX, 2
        call    RCX
        mov     RCX, QWORD PTR [RBP - 48]
        mov     QWORD PTR [R14], RCX
        add     RSP, 48
        pop     RBX
        pop     R14
        pop     RBP

c) Multiple Dispatch

Multiple dispatch allows Object Oriented behavior.  Each function can have several  methods designed to operate on the types of the method parameters. The appropriate method is dispatched at runtime based on the parameter types.

julia> methods(sort)
# 4 methods for generic function "sort":
sort(r::UnitRange{T<:real at="" bstractarray="" dim::integer="" pre="" r::range="" range.jl:533="" range.jl:536="" sort.jl:358="" sort.jl:368="" sort="" v::abstractarray="">

Thursday, February 12, 2015


Julia is a high-level, high-performance dynamic language for scientific computing. It has been gaining traction as a faster alternative to Matlab, R and NumPy and as a more productive alternative to C, C++ and Fortran. Julia is particularly relevant when both expressiveness and performance are paramount – in areas like machine learning, “big statistics”, linear algebra, bioinformatics, and image analysis.

Monday, February 9, 2015

NFL Fantasy Sports API

The NFL has made an impressive Application Programming Interface (API) available to application developers.

 Fantasty Football Web Services  allows existing or new applications to access live NFL data, potentially in real time.  It also allows users to join existing fantasy leagues, or even create new leagues.  Perhaps the most exciting offerings are still to come. Michael Vizard  of Programmable Web writes:

A big part of that effort revolves around giving fans access to statistics and analytics tools that they can use to figure out which players to draft and keep. In the postseason, even went so far as to create a separate fantasy game event that involved just the teams that made the playoffs.

Access to these statistics and analytic tools is something large organizations like the NFL need to compete with the powerful data analysis capabilities available to smaller companies today. In addition to the  playoff fantasy game mentioned above, there's also a prowbowl api available. So the NFL seems to be running with the API.

First, let's have a closer look at the The Web services available now which are documented here: .  Then we'll explore some hidden gems available now in the API that access some underlying predictive analytics, straight form the NFL!

Available Data

Any current player statistics are available through the well documented API calls. To write to the API you'll need a key which you request by emailing .

Scoring Leaders


The data comes back in XML by default, but you can clean that up with a simple format parameter at the end of your request:

Weekly Stats

Users can further filter requests by position, team, week and season.  The available  stats go back to 2009.

Advanced Stats

Additional statistics such as RedZone Touches are also available

{"QB":[{"id":"2533033","esbid":"GRI283140","gsisPlayerId":"00-0029665","firstName":"Robert","lastName":"Griffin","teamAbbr":"WAS","opponentTeamAbbr":"@HOU","position":"QB","stats":{"FanPtsAgainstOpponentPts":"25.00","FanPtsAgainstOpponentRank":"2","Carries":"9","Touches":"9","Receptions":false,"Targets":false,"ReceptionPercentage":false,"RedzoneTargets":false,"RedzoneTouches":"1","RedzoneG2g":false},"status":"Loss, 6-17"}],"RB":[{"id":"2533457","esbid":"MOR317547","gsisPlayerId":"00-0029141","firstName":"Alfred","lastName":"Morris","teamAbbr":"WAS","opponentTeamAbbr":"@HOU","position":"RB","stats":{"FanPtsAgainstOpponentPts":"25.60","FanPtsAgainstOpponentRank":"4","Carries":"28","Touches":"28","Receptions":false,"Targets":false,"ReceptionPercentage":false,"RedzoneTargets":false,"RedzoneTouches":"4","RedzoneG2g":"2"},"status":"Loss, 6-17"}],"WR":[{"id":"80425","esbid":"HAR829482","gsisPlayerId":"00-0026998","firstName":"Percy","lastName":"Harvin","teamAbbr":"NYJ","opponentTeamAbbr":"OAK","position":"WR","stats":{"FanPtsAgainstOpponentPts":"39.00","FanPtsAgainstOpponentRank":"3","Carries":"5","Touches":"11","Receptions":"6","Targets":"8","ReceptionPercentage":"75","RedzoneTargets":"1","RedzoneTouches":"2","RedzoneG2g":"1"},"status":"Win, 19-14"}],"TE":[{"id":"2530473","esbid":"ADA482150","gsisPlayerId":"00-0028337","firstName":"Kyle","lastName":"Adams","teamAbbr":"","opponentTeamAbbr":"Bye","position":"TE","stats":{"FanPtsAgainstOpponentPts":"","FanPtsAgainstOpponentRank":"","Carries":false,"Touches":"1","Receptions":"1","Targets":"1","ReceptionPercentage":"100","RedzoneTargets":false,"RedzoneTouches":false,"RedzoneG2g":false},"status":""}],"K":[{"id":"2499370","esbid":"AKE551610","gsisPlayerId":"00-0000108","firstName":"David","lastName":"Akers","teamAbbr":"","opponentTeamAbbr":"Bye","position":"K","stats":{"FanPtsAgainstOpponentPts":"","FanPtsAgainstOpponentRank":"","Carries":false,"Touches":false,"Receptions":false,"Targets":false,"ReceptionPercentage":false,"RedzoneTargets":false,"RedzoneTouches":false,"RedzoneG2g":false},"status":""}],"DEF":[{"id":"100029","esbid":false,"gsisPlayerId":false,"firstName":"San Francisco","lastName":"49ers","teamAbbr":"SF","opponentTeamAbbr":"@DAL","position":"DEF","stats":{"FanPtsAgainstOpponentPts":"5.00","FanPtsAgainstOpponentRank":"20","Carries":false,"Touches":false,"Receptions":false,"Targets":false,"ReceptionPercentage":false,"RedzoneTargets":false,"RedzoneTouches":false,"RedzoneG2g":false},"status":"Win, 28-17"}]}

Managing Leagues with API Writes

Add a Player

This call adds a player to your fantasy team.

A valid API key will get a success response:


Create a League

You can even create a new league,

then email out links for people to join:

For the application developer and fantasy sports fan, the fun has just begun.

Analytic Tools

Developers and analysts are used to writing their own analytic tools.  The API provides data designed just for custom analytics.  The Pro Bowl API returns players twitter user ids. Those feeds along with the players/news call can keep users up to date with the latest developments.  Potentially that text data can even be mined and analyzed for predictors of next weeks performance.  Does Gronk play better after appearing on Conan?


Predictive analytics has become so prevalent that the NFL is now providing projections of next weeks fantasy points.


Have a look at the JSON response to this request for 2014 week stats.

Algorithms behind the scenes at the NFL boldly predicted at week 1 that NY Giants backup Quarterback Husain Abdullah would have no points the next week (weekProjectedPts)

{"id":"729","esbid":"ABD660476","gsisPlayerId":"00-0025940","name":"Husain Abdullah","position":"DB","teamAbbr":"KC","stats":{"1":"16","70":"58","71":"13","73":"1","76":"1","81":"10","82":"39","84":"2","85":"5","89":"1"},"seasonPts":82.5,"seasonProjectedPts":0,"weekPts":3,"weekProjectedPts":0}

The NFL algorithms were close. If you change the week number to 2 in the url above, you'll see that Husain did scrape up a point the next week.


Season projected points appears in the /playesrs/stats response as well. It appears to be a placeholder for now.  It doesn't change week to week, and it's 0 for any year prior to 2014.  It will be interesting to watch this attribute in 2015.

Watch for more updates to the NFL Fantasy Sports API during the offseason, and for some interesting applications build around the new API.

Friday, November 28, 2014

Exploring Datomic by Øredev Conference

Datomic is a new database with an intriguing distributed architecture. It separates reads, writes and storage, allowing them to scale independently. Queries run inside your application code using a Datalog-based language. Spreading queries across processes isolates them from one another, enabling real-time data analysis without copying to a separate store, opening full query functionality to clients of your system, and more. This talk explores Datomic's architecture and some of it's implications, focused entirely on technical details.

Monday, August 11, 2014

Running Speedment ACE on Amazon EC2

Speedment ACE is a Graph Database Converter, and a powerful software development tool.  A Graph Database  is a NoSql database that performs best when the relationships between nodes of your data are the most important (and frequently accessed) part of that data.

Speedment ACE builds a Graph Data Grid (GDG) automatically, either from existing tables, or using the Ace Front end to generate those tables.  A great way to check out Speedment ACE is to install it on Amazon Web Services.  Here's how to get going with Speedment in 10 easy steps.

1)  Create an AWS MySql Instance.  

Note the database endpoint, name, username and password then connect form a MySql client.

2)  From the Speedment Programmer's Guide

Create a test schema and sample users table:

CREATE SCHEMA speedment_test; 
USE speedment_test; 
CREATE TABLE `speedment_test`.`user` ( 
 `name` VARCHAR(45) NOT NULL,
 `surname` VARCHAR(45) NOT NULL,
 `email` VARCHAR(45) NOT NULL,
 `password` VARCHAR(45) NOT NULL,
 PRIMARY KEY (`id`), UNIQUE INDEX `Index_email`(`email`),
 INDEX `Index_name`(`name`)
) ENGINE = InnoDB;

3) Create an AWS Windows Instance

Be sure to associate the instance with a key pair so you can connect using a Windows RDP Client.  Then in the AWS Console, Launch your new instance, and generate a password using your private key. Also make not of the IP address of your new instance.

4) Connect to your new Windows Instance using an RDP client.

If you have Google Chrome, you can use the 2X Client for RDP/Remote Desktop.

5) Internet Explorer on your new EC2 Windows Instance will ask you

to add every URL you visit for your permission.  You may want to install another browser  to avoid this.

6) Download and install the JSDK 1.7.  

After installing set the JAVA_HOME environment variable to the bin directory where the 'java' executable is located.

7) Download and unzip the Speedment Ace Front End .  

Run it from the ace.bat file located in the bin directory.

8) After registering, Create a new Project.

9) Right-click on the new project and select “Add DBMS”.  

Add the info from the DB instance we created in step 1.

10) Starting on page 28 of the  Speedment Programmer's Guide

you can now explore capabilities of the Speedment Ace front end.  This includes the ability to generate code for rapid application development, and building a Graph Data Grid (GDG), the core of Speedment's optimization capabilities.

Saturday, August 9, 2014

0802 - Intro to Graph Databases by Neo Technology

Join this webinar for a high level introduction to graph databases. This webinar demonstrates how graph databases fit within the NOSQL space, and where they are most appropriately used. In this session you will learn: Overview of NOSQL Why graphs matter Overview of Neo4j Use cases for graph databases

Tuesday, January 7, 2014

Big Data Analytics by ICGX

The data revolution has only just begun. Everyone is talking about Big Data. Big Data grows up - Forbes Business opportunities is Big Data - INC. Big Data powers evolution decision making - WSJ How Big Data got so big - NYT Big Data is hot? Now what? - Forbes Businesses "freak out" over Big Data - Information Week 2012: The year of Big Data - WSJ The age of Big Data - NYT But it's not just hype. The world's data is doubling every 1.2 years. There are 7 billion people in the world. 5.1 billion of them owns cell phone. Each day, we send over 11 billion texts, watch over 2.8 billion YouTube videos and preform almost 5 billion google searches. And we're not just consuming it. We're creating it. We are data agents. We generate over 2.4 quintillion bytes everyday from consumer transactions, communication devices, online behavior, streaming service. In 2012, the world’s information totaled over 2 zetabyes. That’s 2 trillion gigabytes. By 2020, that number will be 35 trillion. We will need 10x more servers, 50x more data management, 75x more files to handle it all. If you're like most companies, you aren't ready. 80% of this new data is unstructured. It is too large, too complex, and too disorganized to be analyzed by traditional tools. There are 500K computer scientists yet only 30K mathematicians. We will fall short of the talent need to understand Big Data by at least 100K. To find opportunities in Big Data, we need new tools and new talent to mine this information and find value. We need Big Data Analytics. Big Data Analytics is more than technology. It’s a new way of thinking. It will help companies better understand customers, find hidden opportunities, even help our government better serve citizens and mitigate fraud. It will inspire hundreds, thousands and even millions of new startups. It will alter the landscape across virtually every industry and finally answer the questions looming over every CEO's head, "How can my business use Big Data?", "What problems can it solve?", "Who should be leading the charge, CIO, CMO, or Chief Data Scientists ?". In every revolution, there are opportunities that will be seized only by those armed with the right tools and right strategy. We are at the beginning of the Big Data Revolution.

Thursday, January 2, 2014

Running OpenTSDB on Amazon EC2

Although there are cheaper alternatives for production systems, It's easy enough to get The Open Time Series Database OpenTSDB running on an EC2 instance of Amazon Web Services.

  1. First you'll need to run HBase on EC2
  2. Make a data directory mkdir hbase_data
  3. vi hbase-0.94.13/conf/ hbase-site.xml
  4. Using vi update the hbase.rootdir property value to: file:///home/ec2-user/hbase-0.94.13/hbase-\${}/hbase
  5. sudo yum install git
  6. git clone git://
  7. sudo yum install automake
  8. yum install gnuplot
  9. cd opentsdb
  10. ./
  11. env COMPRESSION=NONE HBASE_HOME=path/to/hbase-0.94.X ./src/
  12. tsdtmp=${TMPDIR-'/tmp'}/tsd
  13. mkdir -p "$tsdtmp" 
  14. ./build/tsdb tsd --port=4242 --staticroot=build/staticroot --cachedir="$tsdtmp"
  15. In AWS, click on your EC2 instance, then click "Security Groups" at the bottom left.  Click on the default group, then click the "inbound" tab.  You can now open the ec2 port 4242. 
Your ip address on port 4242 will display the web UI for your instance of OpenTSDB:

  • Thursday, December 26, 2013

    Running HBase on Amazon EC2

    1. Create an Amazon Linux EC2 instance. 
    2. Log into your EC2 Instance using ssh.
    3. sudo yum install java-1.6.0-openjdk
    4. wget
    5. tar xfz hbase-*
    6. vi .bashrc
    7. Add this line at the bottom of the file JAVA_HOME=/usr/java/default
    8. sudo vi /etc/hosts
    9. Comment out the localhost line: #   localhost localhost.localdomain
    10. cd  hbase-*
    11. Start HBase ./bin/
    12. Check log files cat logs/hbase-*

    Tuesday, December 24, 2013

    The Journal of Trading: Smart Technology for Big Data

    Smart Technology for Big Data was published in the Winter edition of Journal of Trading.  You need to register to read them.  Here's the Abstract:

    This article provides an underlying structure for managing the big data phenomenon. Innovations and tools fundamental to handling big data are highlighted, and we look at how these technologies are being implemented in the financial industry. 

    See more at:

    Tuesday, December 3, 2013

    The Year of the Yottabyte?

    Big Data has been the big technology buzzword for a couple of years now.  So recently, as a nod to big data, the term yottabyte has  become a top technology buzzword. In my upcoming paper  "Smart Technology for Big Data" (for Institutional Investor Journals)  this chart introduces big data.

    Exhibit 1

    Common usage
    gigabyte (billion)
    computer RAM
    New laptops have about 8GB RAM
    terabyte (trillion)
    discussing computer hard drives sizes
    NYSE produces about
    1 TB  of information day
    petabyte (quadrillion)

    total company storage space
    Facebook’s largest Hadoop cluster contains 100 PB disk space
    exabyte (quitillion)
    all the... in the world
    Global internet traffic is 21 EB per month.
    future storage discussions
    Total size of the internet is about 1 ZB.
    Nearly infinite.

    It's a favorite exhibit of those who have read the paper. Many had never heard of terms larger than a petabyte.  In the article, I mention that the term yottabyte is used in "speculation".  But is that speculation about to enter the realm of reality?  The recent article about Twitter adding security to impede surveillance mentions that the National Security Agency's datacenter in Bluffdale, Utah is "possibly cabable" of storing a yottabyte. We're still in speculation mode but for how long?  Will 2014 be the year of the Yottabyte?

    Popular Articles