Maloy Manna

An introduction to Data Science

2015/05/20

Big Data / Data science / Machine Learning big data / data science / introduction / machine learning

Reading time: 1 minute

I presented a talk last week introducing Data Science and associated topics to some enthusiasts.
Here’s a slide deck I created quickly with markdown using Swipe - a start-up building HTML5 presentation tools.
The contents include:

Data scientist skills
Data science: enablers and barriers
Big data analytics
Data science lifecycle
Use cases
Tools and technology
Project approach
Machine learning
Skills and roles
Learning resources

Here are the slides:

Why Spark is the big data platform of the future

2015/03/23

Big Data / Spark Apache Spark / Hadoop / MLLib / RDD / scala

Reading time: 5 minutes

Apache Spark has created a lot of buzz recently. In fact, beyond the buzz, Apache Spark has seen phenomenal adoption and has been marked out as the successor to Hadoop MapReduce.

Apache Spark

Google Trends confirms the hockey stick like growth in interest in Apache Spark. All leading Hadoop vendors, including Cloudera, now include Apache Spark in their Hadoop distribution.

GoogleTrends - Apache Spark

So what exactly is Spark, and why has it generated such enthusiasm? Apache Spark is an open-source big data processing framework designed for speed and ease of use. Spark is well-known for its in-memory performance, but that has also given rise to misconceptions about its on-disk abilities. Spark is in fact a general execution engine - which has a greatly improved performance both in-memory as well as on-disk, when compared with older frameworks like MapReduce. With its advanced DAG (directed acyclic graph) execution engine, Spark can run programs up to 100x faster than MapReduce in memory, or 10x faster on-disk.

A gentle introduction to Machine Learning

2015/02/04

Big Data / Data science / Machine Learning classification / machine learning / regression / supervised learning

Reading time: 4 minutes

Machine Learning is a big part of big data and data science. A subset of artificial intelligence - a branch of science notorious for requiring advanced knowledge of mathematics. In practice though, most data scientists don’t try to build a Chappie and there are simpler, practical ways to get started with machine learning.

Gmail Priority Inbox

Machine learning in practice involves predictions based on data. Notable examples include Amazon’s product recommendations with the “customers also bought” scroll-list, or Gmail’s priority inbox or any email spam-filter feature. How do these work? For Amazon, clicks by the user is used to learn and predict user behavior and propensity (likelihood) to buy certain items. The items the user is most likely to buy are then displayed on the recommendation system. Gmail’s system learns from the messages which the user reads and/replies to and prioritizes them.

Designing the future - Data Innovation Labs

2015/01/19

Big Data / Data science Agile / data innovation lab / innovation / lab

Reading time: 4 minutes

With the ongoing Big data revolution, and the impending Internet of Things revolution, there has been a renewed enthusiasm in “innovation” around data. Similar to the Labs concept started by Google (think Gmail Beta based on Ajax, circa 2004), more and more organizations, business communities, governments and countries are setting up Labs to foster innovation in data and analytics technologies. The idea behind these “data innovation labs” is to develop avant-garde data and analytics technologies and products in an agile fashion and move quickly from concept to production. Given the traditional bureaucratic setup in large organizations and governments, these Labs stand a better chance of fostering a culture of innovation, due to their being autonomous entities and their startup-mode culture leveraging agile methodologies.

A Brief Introduction to Statistics – Part 3 - Statistical Inference

2015/01/04

Statistics

Reading time: 5 minutes

Statistical inference is concerned primarily with understanding the quality of parameter estimates.

Statistical Inference

The sampling distribution represents the distribution of the point estimates based on samples of a ﬁxed size from a certain population. It is useful to think of a particular point estimate as being drawn from such a distribution. Understanding the concept of a sampling distribution is central to understanding statistical inference.

A sample statistic is a point estimate for a population parameter, e.g. the sample mean is used to estimate the population mean. Note that point estimate and sample statistic are synonymous. Recognize that point estimates (such as the sample mean) will vary from one sample to another, and define this variability as sampling variability (sometimes also called sampling variation).