Maloy Manna

Set up a Hadoop Spark cluster in 10 minutes with Vagrant

2014/12/30

Big Data / Data science cluster / Hadoop / install / multi-node / Spark / vagrant

Reading time: 3 minutes

With each of the big 3 Hadoop vendors - Cloudera, Hortonworks and MapR each providing their own Hadoop sandbox virtual machines (VMs), trying out Hadoop today has become extremely easy. For a developer, it is extremely useful to download a get started with one of these VMs and try out Hadoop to practice data science right away.

Vagrant Hadoop Spark Cluster

However, with the core Apache Hadoop, these vendors package their own software into their distributions, mostly for the orchestration and management, which can be a pain due to the multiple scattered open-source projects within the Hadoop ecosystem. e.g. Hortonworks includes the open-source Ambari while Cloudera includes its own Cloudera Manager for orchestrating Hadoop installations and managing multi-node clusters.

The data science project lifecycle

2014/12/22

Big Data / Data science

Reading time: 4 minutes

How does the typical data science project life-cycle look like?

This post looks at practical aspects of implementing data science projects. It also assumes a certain level of maturity in big data (more on big data maturity models in the next post) and data science management within the organization. Therefore the life cycle presented here differs, sometimes significantly from purist definitions of ‘science’ which emphasize the hypothesis-testing approach. In practice, the typical data science project life-cycle resembles more of an engineering view imposed due to constraints of resources (budget, data and skills availability) and time-to-market considerations.

BI in the digital era

2014/12/19

Big Data / Data science big data / business intelligence / digital era

Reading time: 1 minute

Sometime back I presented a webinar on BrightTalk. The slides for the talk have now been uploaded on Slideshare. The talk focused more on changes in digital technology disrupting businesses, the effect of Big Data, the FOMO (Fear of missing out) effect on big business - and what it meant for changes to the way we do business intelligence in the digital era.

Key themes:

Predictability – Get everything as a service – from IaaS, PaaS and SaaS to XaaS

2014/12/04

Strategy AWS / cloud / IaaS / PaaS / SaaS

Reading time: 6 minutes

The outsourcing model which led to the “on-demand” “as a service” model, has taken off with increasing adoption of cloud-computing and mobility. What started out with the SaaS – software as a service model, has now diversified into several other services.

Indeed, cloud computing has come to rest on three of these as its core pillars:

SaaS: Software as a Service
PaaS: Platform as a Service
IaaS: Infrastructure as a Service

Differentiating SaaS, PaaS and IaaS:

A Brief Introduction to Statistics - Part 2 - Probability and Distributions

2014/12/03

Statistics bayes theorem / conditional probability / law of large numbers / probability / random variable

Reading time: 6 minutes

Probability concepts form the foundation for statistics.

Odds-Probability

A formal definition of probability:

The probability of an outcome is the proportion of times the outcome would occur if we observed the random process an inﬁnite number of times.
This is a corollary of the law of large numbers:
As more observations are collected, the proportion of occurrences with a particular outcome converges to the probability of that outcome.

Disjoint (mutually exclusive) events as events that cannot both happen at the same time. i.e. If A and B are disjoint, P(A and B) = 0
Complementary outcomes as mutually exclusive outcomes of the same random process whose probabilities add up to 1.
If A and B are complementary, P(A) + P(B) = 1