Maloy Manna

Now on Amazon - download the BIguru blog app!

2014/11/05

Reading time: 1 minute

The BIguru BI Blog app is now available on the Amazon AppStore!

To search and download the app, go to the Amazon AppStore and search for “Biguru BI Blog”.

To download and install, you’ll need to follow instructions for your Android smartphone, i.e. you’ll need to “enable unknown sources” as outlined by Amazon.

BIguru BI Blog app

Once you’ve downloaded and installed it (your smartphone Anti-Virus should scan the app after installation) by accepting the defaults, you’re free to get updates on new posts from this blog!

A Brief Introduction to Statistics - Part 1

2014/10/28

Data science / Statistics alternative model / box plot / categorical / explanatory variable / histograms / hypothesis test / independence model / numerical / response variable / sampling / statistical inference / statistics / types of variables

Reading time: 4 minutes

What is Statistics?
Collected observations are called data. Statistics is the study of how best to collect, analyze, and draw conclusions from data. Each observation in data is called a case. Characteristics of the case are called variables. With a matrix/table analogy, a case is a row while a variable is a column.

Statistics - Correlation (Courtesy: xkcd.com)

Types of variables:

Numerical - Can be discrete or continuous, and can take a wide range of numerical values.
Read more

Big Data Basics - Part 4 : NoSQL and NewSQL explained

2014/10/01

Big Data ACID / key-value / mongodb / NoSQL / sharding

Reading time: 7 minutes

Big Data Basics: NoSQL and NewSQL

This is the fourth part of a series of posts on big data. Read the previous posts here: Part-1, Part-2 and Part-3.

With the ongoing data explosion, and the improvement in technologies able to deal with it, businesses are turning to leverage this big data for mining insights to gain competitive advantage, reinvent business models and create new markets.

A huge amount of this “big data” volumes comes from system logs, user generated content on social media like Twitter or Facebook, sensor data and the like. All of these types of data are what we call “unstructured”. Businesses which do not leverage the vast amount of unstructured data available to them, risk losing out valuable insights from such data types.

Basics of Big Data - Building a Hadoop data warehouse

2014/05/12

Big Data big data / big data analytics / Cloudera / data mining / data warehouse / data warehousing / ETL / Hadoop / HDFS / Hive / Hortonworks / Impala / parquet / R / schema-on-read / schema-on-write / syncsort / yarn

Reading time: 5 minutes

This is the 3rd part of a series of posts on Big Data. Read Part-1 (What is Big Data) and Part-2 (Hadoop).

Traditionally data warehouses have been built with relational databases as backbone. With the new challenges (3Vs) of Big Data, relational databases have been falling short of the requirements of handling

New data types (unstructured data)
Extended analytic processing
Throughput (TB/hour loading) with immediate query access

The industry has turned to Hadoop as a disruptive solution for these very challenges.

Basics of Big Data – Part 2 - Hadoop

2014/04/13

Big Data ambari / big data / Cloudera / hadapt / Hadoop / Hbase / HDFS / Hive / Hortonworks / Hue / Impala / kerberos / knox / MapR / mapreduce / oozie / pig / sentry / Spark / stinger / storm / zookeeper

Reading time: 8 minutes

As discussed in Part 1 of this series, Hadoop is the foremost among tools being currently used for deriving value out of Big Data. The process of gaining insights from data through Business Intelligence and analytics essentially remains the same. However, with the huge variety, volume and velocity (the 3Vs of Big Data), it’s become necessary to re-think of the data management infrastructure. Hadoop, originally designed to be used with the MapReduce algorithm to solve parallel processing constraints in distributed architectures (e.g. web indexing) of web giants like Yahoo or Google, has become the de-facto standard for Big Data (large-scale data-intensive) analytics platforms.