Over the last few years, there has been paradigm shift on how large data sets are stored and processed. Traditional approach was to extract the data from distributed systems, store in centralized data warehouse and develop analytics on top of centralized data warehouse. The central data warehouses were the relational database management systems used mainly to processes structured data such as financial transaction, shipping data, employee information. However, with the technological advancement in various fronts, it has become possible to generate large volume of data sets; both structured and unstructured, in high frequency through varieties of data sources. This led to a tipping point when the traditional architecture was not scalable, efficient enough to store, process and retrieve those large volume of data sets.
Big Data Definition (How it is different than Traditional Data)
Big Data is regarded as the next big thing in computing after Internet revolution. It is the term used to describe the exponential growth, availability and usage of structured and unstructured data. Big Data is becoming global phenomena in all the sectors including government, business and societies because of the availability of high volume of data, efficient and affordable processing technologies leading to more accurate analyses based on facts. The more accurate the analyses are better would be the understanding of business problems, opportunities and threats leading to better decision making for operational excellence in any business.
The relevance and emergence of Big Data have increased because of the 3Vs (Volume, Velocity and Variety) of data. Volumes refer to the large amount of data being generated through a variety of sources. These data could be structured such as financial transaction, invoices, personal detail or unstructured such as twitter/facebook messages, reviews, videos, images etc. Velocity indicates the speed at which the new data is being generated. Social media messages, sensor data are the examples of high volume of data being generated with high velocity. Variety refers to different types of data available for storage, retrieval and analysis. Traditionally data used to be stored in relational database in a very structured way (rows, columns value) whereas these days those structured data only represent 20% of data being generated.
Difference between Big Data and Traditional Data
Traditional data includes the data such as documents, finances, stocks and personal files, which are more structured and mostly human generated. Whereas Big Data mainly refers to the large volume of data being generated through a variety of sources, not just human but also machines, processes etc. These data could be social media contents (twitter messages, facebook likes/messages, youtube videos, pinterest pins etc), sensor data (iBeacons), RFID data, scientific research data with millions, billions in numbers and large volume of size.
Why Big Data Matters (What is triggering the Phenomena)
In last few years businesses have witnessed rapid growth of data whether it is in retail industry, logistics, financial or health industry. There are many reasons, which are contributing to the growth of the data. Total cost of application ownership is being reduced; application providers are offering cloud/subscription based solutions. Internet is becoming accessible to more and more end customers with the availability of smart phones at affordable cost and higher bandwidth (4G). Social media is becoming mainstream of daily life, business, and politics. More and more objects/things are getting connected to Internet (popularly known as Internet of Things) generating data in each stage of supply chain / value chain of businesses. Data is not only being generated by human but also by machines in terms of RFID feeds, medical records, sensor data, scientific data, service logs, web contents, maps, GPS logs etc. Some data are being generated so fast that there is not even time to store it before applying analytics to it.
This phenomenon of exponential growth of structured and unstructured data with higher speed is forcing the business to explore it rather than ignore it. To remain competitive, businesses have to leverage the data they can get within and outside the organization and use it in the decision-making process whether it is to understand the customers, suppliers, employees or the internal/external processes.
How to get a handle on analyzing the big data
The process of getting handle on analyzing the big data starts with the analysis of the company’s business environment, strategies, understanding the sources of data and their relevance in the business. It is also important to know how the data driven companies and other analytical competitors are exploiting Big Data in order to achieve the strategic advantage. The process of understanding big data analytics does not only involve understanding the technical aspects of big data infrastructure like Hadoop but also logical aspects of analytics such as data modeling, mining and its application in business in order to make better decisions. Literature reviews and researches would not be sufficient for a company to get confidence whether the big data analytics is the way forward for them. Companies can actually start a pilot project focusing on one strategic aspect of business whereby Big Data analytics could add value in the decision making process.
Hadoop and MapReduce
Hadoop is an open source software infrastructure that enables the processing of large volume of data sets in distributed clusters of commodity servers. The main advantage of Hadoop is its scalability as the server can be scaled from one to thousands of servers with the possibility of parallel processing (computing). Hadoop has mainly two components a) HDFS and b) MapReduce. HDFS refers to the Hadoop Distributed File System that spans across all the nodes of the cluster within Hadoop server architecture. Unlike RDBMS, it is schema less architecture, which can easily store different types of structured and unstructured data. Hadoop is also fault tolerant that means if one node fails then it uses other backup nodes and recovers easily.
MapReduce is at the core of Hadoop. It is highly scalable cluster based data processing technology which consists of two main tasks a) Map and b) Reduce, executed in sequential order. The “map” task takes a set of data from source and converts it into key/value pairs. The “reduce” job then takes these key/value pairs outputs from “map” job and combines (reduces) those tuples (rows) into smaller number of tuples. For example, there is a requirement to collect “twitter” data for a newly released song for sentiment analysis. The chosen keywords are “I love it”, “I like it”, “I hate it”. The map job finds these keys and computes to come up with value (count) in each data set stored in a node of HDFS and the “reduce” task combines these computations at all nodes level to come up with final set of computation (e.g. final counts of these key/values). This way it becomes possible to process vast volume of data based on the scalability of the cluster.
Successful Users and Leaders of Big Data
Google is the main inventor, leader and user of Big Data technologies such as Map reduce, BigTable and BigQuery. They used to store data in the legacy database management system and it became impossible to capture whole worldwide webs and index frequently to update the search indexes. They needed a solution, which is scalable, efficient and could handle large volume of data. The solution, which was relevant to the data driven companies like Google, Netflix, and Amazon, are becoming the mainstream of other industries as well.
Facebook is one of the biggest users of Hadoop as it claims to have used HDFS for 100 petabytes of disk space as of July 2012. The site stores billions of photos everyday. They have been using number of tools such as Hadoop, Hive and HBase for big data analytics.
Pinterest stores billions of pins (images, websites etc) everyday and they have been using Hadoop to run various analysis with all the data generated by their users and to provide personalized experience to the users with recommendations and suggestions.
IBM, SAS, SAP, Intel and Oracle, Cloudera, Hortonworks
Companies like Cloudera and Hortonworks are bringing the disruptive innovation in big data consulting and solution offering as they were established to capitalize the paradigm shift of Big Data. At the same time, the existing database and analytics giants like IBM, SAS, SAP, Oracle are transforming themselves offering big data solution coupled with their existing applications.
Data is becoming the energy of 21st century, a very important resource for every organization. The most important challenge would be to implement a sustainable, affordable, stable solution, which could bring insights and knowledge out of the mountains of data generated by human and machines every second. Big Data solution like Hadoop is being increasingly adopted by industries to exploit their structured and unstructured data thereby creating value through the cycle of data information insight knowledge intelligence and developing strategic capability as analytical competitor. Despite the challenges, more and more organizations are expected to join this bandwagon in coming future.