March 07, 2019

Data Democratization




At a fundamental level typical business management need answers to Five W's and one H.
Let's take a example of e-Commerce company.


  1. Who is responsible for  revenue growth?
  2. What was the most popular product  last month?
  3. When the new features will be launched?
  4. Where we are in terms of revenue growth?
  5. Why web site traffic growth is low since last quarter?
  6. How we turn revenue trend from negative to positive?



Dashboards are great at presenting answers to the "What ","Where" ,"What" questions.
Unfortunately, the "Why" , "How" questions are often much more difficult to tackle, and they typically require a data investigation of sorts. It typically involves a data deep dive that needs to be tackled from a variety of angles which have not been planned for. How do we achieve the goal without hiring a massive data staff or expecting all business employees to become data scientists? 



One must-have is to make the data easily accessible to those who need it.  Gone are the days when you should typically require a long business justification and second line manager approval.  We need to lower the barriers to access standard, non-sensitive business data and should provide Self Serve Data Analysis rather than facilitated data analysis. Need to build a true data democracy  to enable non-data expert SME's to perform self-service analytics.
Data being the “oil” the benefits should be shared freely with all types of users in an understandable format. This data could be further refined or consumed for appropriate data – driven decisions.


Data democratization is the ability for information in a digital format to be accessible to the average end user and there are no gatekeepers that create a bottleneck at the gateway to the data.The goal of data democratization is to allow non-specialists to be able to gather and analyze data  so that they can use it to expedite decision-making and uncover opportunities for an organization. The goal is to have anybody use data at any time to make decisions with no barriers to access or understanding.

Data Democratization is a process and has to be embedded and called out into the regular Big Data Development Life Cycle. It involves people, process, and technology to arrive at the innovative, valuable business decisions from the insights gained. Data lake as a technology or platform helps in implementing data democracy more efficiently and effectively.



A data lake is a raw collection of data, and users would only worry about the format at the time of access.
The enterprise data lake is the core and future of the Modern Data warehouse architecture which is complemented by the components of metadata management, master data management, data governance, and security across the layers Data Lake allows data to be stored in the native form and therefore broadens the horizon of usage and increase flexibility and adaptability as per the requirement.

Creating DataFrames from CSV in Apache Spark

 from pyspark.sql import SparkSession spark = SparkSession.builder.appName("CSV Example").getOrCreate() sc = spark.sparkContext Sp...