November 21, 2022

GridFS - Large Document Storage in MongoDB

 The maximum BSON document size is 16 megabytes.

To store more than 16 MB document, we need to use GridFS.

GridFS is a MongoDB file system abstraction that is used for storing and retrieving huge image, audio, or video files which exceed the BSON-document size limit of 16 MB.

Instead of storing a file in a single document, GridFS divides the file into parts.

GridFS takes a file and splits it into sections called chunks. By default, each chunk size is 255 KB (this is a configurable parameter).

GridFS creates 2 collections — the chunk collection and the file collection and places them in a common bucket by prefixing each with the bucket name(the default name is fs) — fs.chunks and fs.files. The bucket is only created on the first read/write operation, if it does not exist.

 The split chunks are stored as documents in the chunk collection, while the additional metadata is saved in the file collection.

 



Use Cases:

  • Content management systems
  • Healthcare - Patient health record repository includes general information about the patient (name, address, insurance provider, etc.) along with all various types of medical records (office visits, blood tests and labs, medical procedures, etc.)
  • Movie/Audio streaming - Reads the Movies/Audio  directory  and related all the .MOV/.wav files

Configuration and Migration Steps:

 1.Create indexes  on GridFS collections of the source cluster.

 2. Identify documents in GridFS collections which are greater than 16 MB

 3. Use MongoPush to copy  all indexes from step 2

 

Limitation/drawback:

Do not use GridFS if you need to update the content of the entire file atomically.

Slower performance compared to file system or serving the file from a server.

 https://www.mongodb.com/docs/manual/core/gridfs/

 

November 20, 2022

Digital Personal Data Protection Bill 2022

India’s Ministry of Electronics and Information Technology (MeitY) on 18 Nov 2022 drafted much-awaited Digital Personal Data Protection (DPDP) Bill, 2022  and proposed a new comprehensive data privacy law that will mandate how companies handle data of its citizens, will apply to businesses operating in the country and to any entities processing the data of Indian citizens.

The purpose of this Act is to provide for the processing of digital personal data in a manner that recognizes both the right of individuals to protect their personal data and the need to process personal data for lawful purposes, and for matters connected therewith or incidental thereto.

The draft also proposes that companies only use the data they have collected on users for the purpose they obtained them originally. It also seeks accountability from the firms that they ensure that they are processing the personal data for the users for the precise purpose they collected it.

India Data Story - Aadhar Card

Aadhaar is a 12 digit individual identification number issued by the Unique Identification Authority of India on behalf of the Government of India. The number serves as a proof of identity and address, anywhere in India.

The UIDAI is collecting basic data fields in order to be able to establish identity– this includes Name, Date of Birth, Gender, Address, Parent/ Guardian’s name essential for children but not for others, mobile number and email id is optional as well . The idea initially behind Aadhar was to use it mainly for social welfare programs to identify leakages, to identify ghost beneficiaries and to weed them out and to make social welfare schemes more efficient.

Over time, what happened was that its use was expanded to other purposes which are not social welfare purposes, such as, say doing your customer norms for telephones, for linking your PAN numbers to your bank accounts and so on. 

Now almost all companies, small businesses like  real estate companies, hospitals, banks, insurance companies,auto dealerships, marketing agents collect data. They are collecting data in different wayse.g.  Restaurants who collect your phone numbers so that they can, you know, give you a goodie on your birthday. Bank and Insurance companies collect data to enroll customers,etc.

Why DPDP needed?

Data is the Oil of the Digital Age. Companies gather data about a person’s online behavior, what people are doing, buying, eating, and so on. This data now powers the new age of AI. But now it is time to protect their personal data and the need to process personal data for lawful purposes.

The road ahead - 

 Data Protection Strategy:

  • Appoint a Data Protection Officer who shall represent the Significant Data Fiduciary under the provisions of this Act and be based in India.
  • Data Encryption in Transit and Data Encryption at Rest strategy 
  • Identify privacy that would apply to each context lke bank, finance,health records 
  • Design backend systems for actually collecting that consent
  • Create a categorization for different categories of data - Personal data, sensitive personal data 
  • Design an access system and policy for personal data security
  • Create Standard Operating Procedures for the execution of personal data security activities

 Data Protection Officer

A data protection officer is responsible for overseeing an organization’s data protection strategy and implementation.

  

Data Protection Implementation:

  •   Implement Data Encryption in Transit and Data Encryption at Rest
  •   Encrypt all personal data  after being processes 
  •   Implement Data Isolation and Protection
  •   Get your consent forms in order
  •   Implement granular opt-in
  •   Make sure users can easily withdraw their consent
  •   Implement DND and DNC system
  •   Erase unsubscribed user data
  •   Implement Data Linegae
https://avishkarm.medium.com/digital-personal-data-protection-bill-2022-db2f5fd367cd



 


 


November 17, 2022

Data Engineering Glossary

 Chief Data Officer (CDO)

CDO is a senior executive role is to establish firm's data and information governance strategy, control, policy development and effective implementation plan to successfully create business value and showcase the benefits & ROI.

Data Program Manager

A Data Program Manager ensures that the program charter is in line with the organizational data strategy and roadmap for the future

Data Protection Officer

“Data Protection Officer” means an individual appointed as such by a Significant Data Fiduciary under the provisions of this Act

https://avishkarm.blogspot.com/2022/11/digital-personal-data-protection-bill.html

 Data Scientist

Data scientists perform research and tackle open-ended questions. A data scientist has domain expertise, which helps him or her create new algorithms and models that address questions or solve problems. The data scientist takes the data visualizations created by data analysts a step further, sifting through the data to identify weaknesses, trends, or opportunities for an organization. The data scientist role is critical for organizations looking to extract insight from information assets for “big data” initiatives and requires a broad combination of skills that may be fulfilled better as a team. 

Data Steward

Data stewardship is the management and oversight of corporate data by designated personnel who typically don’t “own” the data but who ensure adherence to data laws and internally established data governance policies. They act as trustees of data, are intimately knowledgeable with business process and data usage. Their area of responsibility addresses issues such as data quality, accessibility, usability, and security.

 

Database Architect

Data Architect gather requirements from Business and Technology team, determine high-level design  and design Business data and rules in meaningful and consistent manner, pick the right data technology, review database objects like tables, SPs and database design.

DBA

DBA designs, implements, administers, and monitors data management systems and ensures design, consistency, quality, and security. Perform data housekeeping activities like Storage, Backup/Restore, Performance optimization 

DBA- Database Administrator

DBA - Best Practices

DataOps

DataOps (data operations) is an agile, process-oriented methodology for developing and delivering analytics. DataOps (data operations)  brings together DevOps teams with data engineer and data scientist roles to provide the tools, processes and organizational structures to support the data-focused enterprise. DataOps focus on the collaborative development of data flows and the continuous use of data across the organization.

Data Engineer 

The data engineer moves data from operational systems into a data lake and writes the transforms that populate schemas in data warehouses and data marts.

Data Engineers are the individuals in an organization responsible for setting up the data infrastructure, overseeing the data processes, and building the data pipelines that convert raw data into consumable data products.

Data Analyst

The data analyst takes the data warehouses created by the data engineer and provides analytics to stakeholders. The data analyst creates visual representations of data to communicate information in a way that leads to insights either on an ongoing basis or by responding to ad-hoc questions. The data analyst serves as a gatekeeper for an organization’s data so stakeholders can understand data and use it to make strategic business decisions. data analysts draw conclusions from data to describe, predict, and improve business performance. They form the core of any analytics team and tend to be generalists versed in the methods of mathematical and statistical analysis.

Data Principal

“Data Principal” means the individual to whom the personal data relates and where such individual is a child includes the parents or lawful guardian of such a child; (7)      

Data Processor

“Data Processor” means any person who processes personal data on behalf of a Data Fiduciary

Data Lakehouse

A data lakehouse is a new, open data management architecture that combines the flexibility, cost-efficiency, and scale of data lakes with the data management and ACID transactions of data warehouses, enabling business intelligence (BI) and machine learning (ML) on all data.

Data Lake

A data lake is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data

Data Warehouse

A data warehouse is a data storage technology that brings together data from multiple sources into a single system. It serves as a centralized data hub holding large amounts of historical data that users can query for the purpose of analytics.

Data%20Warehouse

Data Model 

Data models are visual representations of an enterprise’s data elements and the connections between them. By helping to define and structure data in the context of relevant business processes, models support the development of effective information systems. They enable business and technical resources to collaboratively decide how data will be stored, accessed, shared, updated and leveraged across an organization.

Data Modeling

Data hub

Data hubs are data stores that act as an integration point in a hub-and-spoke architecture. They physically move and integrate multi-structured data and store it in an underlying database Data mesh

A data mesh is a new approach to designing data architectures. It takes a decentralized approach to data storage and management, having individual business domains retain ownership over their datasets rather than flowing all of an organization’s data into a centrally owned data lake

 

Data fabric

A data fabric is an architectural design that enables connection to data regardless of where it is stored. This makes it possible to store data in separate “siloed” data lakes or data warehouses, each with localized control and governance, while still allowing users to perform queries across the entirety of an organization’s data assets. The idea of a data fabric is to balance the pros and cons of centralized vs. decentralized data architectures, making it possible to have strong data protection and security without sacrificing data visibility or insights. Data fabrics work by unifying data assets at the compute level, rather than the storage level. In this architecture, data can flow from different sources to a unified app and be analyzed together without duplicating storage.

Data drift

Data drift refers to a change in data structure or meaning that can occur over time and cause machine learning models to break. It occurs frequently when ML models seek to describe continually changing (dynamic) circumstances or environments.

Data virtualization

Data virtualization involves creating virtual views of data stored in existing databases. The physical data doesn’t move but you can still get an integrated view of the data in the new virtual data layer. This is often called data federation (or virtual database), and the underlying databases are the federates.

Data Migration

Data migration is the process of moving data from one system to another. It is mostly used in the context of the extract/transform/load (ETL) process. The extracted data needs to go through a series of functions in preparation, transformation after which it can be loaded in to a target location.

Data Democratization

Data democratization means that everybody has access to data and there are no gatekeepers that create a bottleneck at the gateway to the data and educate them on how to work with data, regardless of their technical background.

Data Science

Data science combines math and statistics, specialized programming, advanced analytics, artificial intelligence (AI), and machine learning with specific subject matter expertise to uncover actionable insights hidden in an organization’s data. These insights can be used to guide decision making and strategic planning.

Data Visualization

Data visualization is the graphical representation of information and data. By using visual elements like charts, graphs, and maps, data visualization tools provide an accessible way to see and understand trends, outliers, and patterns in data. Additionally, it provides an excellent way for employees or business owners to present data to non-technical audiences without confusion.

Data dictionary

Data dictionary is a collection of data elements in a database or data model with detailed description of its format, relationships, meaning, source and usage across an organization.

Enterprise Data Management (EDM)

Enterprise data management (EDM) refers to a set of processes, practices, and activities focused on data accuracy, quality, security, availability, and good governance.

Master Data Management(MDM)

Master data is all the data critical to the operation of a business. This data is usually shared across the enterprise, and multiple departments and personnel depend on it for decision-making.

Master data management (MDM) involves creating a single master record for each person, place, or thing in a business, from across internal and external data sources and applications. This information has been de-duplicated, reconciled and enriched, becoming a consistent, reliable source. Once created, this master data serves as a trusted view of business-critical data that can be managed and shared across the business to promote accurate reporting, reduce data errors, remove redundancy, and help workers make better-informed business decisions.

Metadata

Metadata is simply data about data. It means it is a description and context of the data. It helps to organize, find and understand data.

Data Modernization

Data modernization is the process of transferring data to modern cloud-based databases from outdated or siloed legacy databases, including structured and unstructured data. In that sense, data modernization is synonymous with cloud migration

Data Architecture

Data architecture translates business needs into data and system requirements and seeks to manage data and its flow through the enterprise. A data architecture describes how data is managed--from collection through to transformation, distribution, and consumption. It sets the blueprint for data and the way it flows through data storage systems. It define the respective data model and underlying data structures, which support it. Modern data architectures often leverage cloud platforms to manage and process data.

Data quality 

Data quality is an integral part of data governance that ensures that your organization’s data is fit for purpose. It refers to the planning, implementation, and control of activities that apply quality management techniques to data, in order to assure it is fit for consumption and meet the needs of data consumers.

Data observability

Data observability is the ability to understand, diagnose, and manage data health across multiple IT tools throughout the data lifecycle. A data observability platform helps organizations to discover, triage, and resolve real-time data issues using telemetry data like logs, metrics, and traces

Data Linegae

Data lineage uncovers the life cycle of data—it aims to track the complete data flow, from start to finish over time, providing a clear understanding understanding, recording, and visualizing data as it flows from data sources to consumption. This includes all transformations the data underwent along the way—how the data was transformed, what changed, and why.

Data Privacy

Data privacy is a guideline for how data should be collected or handled, based on its sensitivity and importance. Data privacy concerns apply to all sensitive information that organizations handle, including that of customers, shareholders, and employees. Often, this information plays a vital role in business operations, development, and finances. Data privacy helps ensure that sensitive data is only accessible to approved parties.

 

Data protection

Data protection is a set of strategies and processes you can use to secure the privacy, availability, and integrity of your data. A data protection strategy is vital for any organization that collects, handles, or stores sensitive data. A successful strategy can help prevent data loss, theft, or corruption and can help minimize damage caused in the event of a breach or disaster.

 

Data Security

Data security is the practice of protecting digital information from unauthorized access, corruption, or theft throughout its entire lifecycle. It’s a concept that encompasses every aspect of information security from the physical security of hardware and storage devices to administrative and access controls, as well as the logical security of software applications. It also includes organizational policies and procedures.

Data Encryption

Data Program Management (DPM) is the intelligent application of data management tools, technologies, and processes to improve the usefulness of an organization’s data

Data blending

Data blending is a process that allows the users to quickly get value from multiple data sources by helping them see patterns.

Data Governance

Data governance is the collection of policies, processes and standards that define how data assets can be used within an organization and who has authority over them. Governance dictates who can use what data and in what way.

 

Data catalog 

A data catalog is a comprehensive collection of an organization’s data assets, which are compiled to make it easier for professionals across the organization to locate the data they need. 

 

Data Modeling 

Data modeling is the process of analyzing and defining all the different data your business collects and produces, as well as the relationships between those bits of data. Data modeling concepts create visual representations of data as it’s used at your business, and the process itself is an exercise in understanding and clarifying your data requirements.

Data munging
Data munging is the process of manual data cleansing prior to analysis. It is a time-consuming process that often gets in the way of extracting true value and potential from data

Data pipeline

A data pipeline is a sequence of steps that collect, process, and move data between sources for storage, analytics, machine learning, or other uses. For example, data pipelines are often used to send data from applications to storage devices like data warehouses or data lakes.

Data profiling

Data profiling is the process of evaluating the contents and quality of data. It is used to identify data quality issues at the start of a data project and define what data transformation steps may be needed to bring the dataset into a ready-to-use state.

 

Ad-hoc query

An ad-hoc query is a single-use query generally used to answer “on-the-fly” business questions for which there are no pre-written queries or standard procedures.

Batch processing

Batch processing refers to the scheduling and processing of large volumes of data simultaneously, generally at periods of time when computing resources are experiencing low demand. Batch jobs are typically repetitive in nature and are often scheduled (automated) to occur at set intervals

Data cleansing

Data cleansing, data cleaning or data scrubbing is the first step in the overall data preparation process. It is the process of analyzing, identifying and correcting messy, raw data. Data cleaning involves filling in missing values, identifying and fixing errors

Data Wrangling

Data wrangling is the process of cleaning, structuring and enriching raw data into a desired format for better decision making in less time.

Data Masking

Data masking is the process of replacing sensitive information copied from production databases to test non-production databases with realistic, but scrubbed, data based on masking rules. Data masking is ideal for virtually any situation when confidential or regulated data needs to be shared with non-production users. These non-production users need to access some of the original data, but do not need to see every column of every table, especially when the information is protected by government regulations

Data mining

Data mining is the process of finding anomalies, patterns and correlations within large data sets to predict outcomes. Using a broad range of techniques, you can use this information to increase revenues, cut costs, improve customer relationships, reduce risks and more.

 

Data integration

Data integration is the process of combining data from different sources into a single, unified view. Integration begins with the ingestion process, and includes steps such as cleansing, ETL mapping, and transformation.

SQL Performance Tuning

SQL tuning is the process of improving SQL queries to accelerate your servers performance. It's general purpose is to reduce the amount of time it takes a user to receive a result after issuing a query, and to reduce the amount of resources used to process a query. The lesson on subqueries introduced the idea that you can sometimes create the same desired result set with a faster-running query. In this lesson, you'll learn to identify when your queries can be improved, and how to improve them.

 

Data Analytics

Data analytics analyzes internal and external data to create value and actionable insights.

Data Fiduciary

 “Data Fiduciary” means any person who alone or in conjunction with other persons determines the purpose and means of processing of personal data

 Cloud data warehouse

A cloud data warehouse is a database that is managed as a service and delivered by a third party, such as Google Cloud Platform (GCP), Amazon Web Services (AWS), or Microsoft Azure. Cloud data architectures are distinct from on-premise data architectures, where organizations manage their own physical database infrastructure on their own premises.

Big Data

Big data is a term that describes large, hard-to-manage volumes of data – both structured and unstructured – that inundate businesses on a day-to-day basis. But it’s not just the type or amount of data that’s important, it’s what organizations do with the data that matters. Big data can be analyzed for insights that improve decisions and give confidence for making strategic business moves.

NoSQL

NoSQL databases (aka "not only SQL") are non-tabular databases and store data differently than relational tables. NoSQL databases come in a variety of types based on their data model. The main types are document, key-value, wide-column, and graph. They provide flexible schemas and scale easily with large amounts of data and high user loads. NoSQL databases are built from the ground up to store and process vast amounts of data at scale and support a growing number of modern businesses.

Business Intelligence

Business intelligence (BI) refers to capabilities that enable organizations to make better decisions, take informed actions, and implement more-efficient business processes. BI keeps your organization in the know, and success depends in a large part on knowing the who, what, where, when, why, and how of the market. Business intelligence tools analyze historical and current data and present findings in intuitive visual formats.

Data

"Data” means a representation of information, facts, concepts, opinions or instructions in a manner suitable for communication, interpretation or processing by humans or by automated means.

Personal data

“Personal data” means any data about an individual who is identifiable by or in relation to such data;

Personal data breach

"Personal data breach" means any unauthorised processing of personal data or accidental disclosure, acquisition, sharing, use, alteration, destruction of or loss of access to personal data, that compromises the confidentiality, integrity or availability of personal data. 

 

Creating DataFrames from CSV in Apache Spark

 from pyspark.sql import SparkSession spark = SparkSession.builder.appName("CSV Example").getOrCreate() sc = spark.sparkContext Sp...