December 01, 2021

Data Engineering Analogy

Data is the new oildata is the economic  fuel of the future. Like oil, data must be refined, cleansed.

Data is the new energy. You consume data irrespective of from which source it came and enriched similar to energy , it come from windmills stored in power station and consumed by end user and pay per use.

Data is the new sunlight. Like solar rays, Data will be everywhere and underlie everything



November 10, 2021

Always On Availability Group - Offloading Reporting Approach

An availability group supports a failover environment for a discrete set of user databases, known as availability databases, that fail over together. An availability group supports a set of primary databases and corresponding secondary databases which continuously back up databases and their transaction logs on a regular basis.

 The Source database sends transaction log records to secondary database and then applies them to secondary database. Data synchronization occurs between the primary database and its corresponding secondary database so that changes are applied in real-time (or near-real-time) while users are still in the database. The secondary databases server must have the same specs as the primary databases server. AlwaysOn is implemented on a per-database basis. We don't need to add all databases on source database server. Only those databases which need for real time reporting need to configure for AlwaysOn.

This secondary database can be used for real-time reporting purpose. Also, if we would like to prevent read-only workloads from running on the primary database, we can use secondary database to allow only read-write access for running complex queries.

 Benefits

  1. Real time Reporting

  2. Enhanced Performance

  3. Complex queries can run without compromising database performance.

  4. High Availability (HA) with manual failover 

Infrastructure and Software needed for AlwaysOn

  1. Windows Server 2012 or later versions.

  2. SQL Server 2014 Enterprise edition or later versions

  3. Since we can run a query on a secondary replica, all SQL servers need to have license.

  4. Both SQL & OS Editions, Versions should be at same level on all participating nodes.

  5. All replicas in your AlwaysOn group must be in the same windows domain. They must be able to communicate with each other.

  6. SQL Server port must be opened at firewall level for communication between replicas.

  7. Ensure that each computer is a Node in a Windows Server Failover Clustering (WSFC)

Cost and Budget

For setting AlwaysOn, a minimum of 2 SQL servers with Enterprise edition (Source database and Secondary database)  are required.

 Analytics workload or read-only queries running on the primary replica can be run on the secondary replica without requiring any changes.




Comparison: AlwaysOn Vs Database Mirroring



AlwaysOn

Database Mirroring

Reporting

Readable replicas allow real time reporting

The mirror database is always in a state of recovery, which essentially means it can’t be used while mirroring is active. You can create a snapshot of the mirror database to use for reporting, but the snapshot is a point-in-time copy making it unsuitable for real time reporting purposes

Read-only routing

Supports the re-routing of read-intent connection requests to a readable secondary replica. 

This feature in not available in Database Mirroring.

Performance Considerations

Performing highly CPU, I/O-intensive read-only workloads on the secondary databases, it will improve the database performance of Source database.

Synchronous mirroring requires a 2-phase commit to take place which could introduce delays on the principal.

Additional H/W

Integrated with Windows Server Failover Clustering (WSFC)

Database mirroring doesn’t require WSFC but if you want to use it in High Safety mode for automatic failover, then you need to use a third SQL Server system as a witness.

Feature support

Only available with SQL Server Enterprise edition

Available with both SQL Server Enterprise and Standard edition. Synchronous mirroring is available only on Enterprise edition.

Risks

All active workloads must be licensed and requires additional cost of SQL server license.

Database Mirroring is available in SQL2012 and SQL2014 but will be deprecated in future SQL Server releases.

Schema or data changes

You cannot make schema or data changes on the secondary databases which limits creating additional indexes on secondary databases

The mirror database is always in a state of recovery, which essentially means it can’t be used while mirroring is active

Benefits

Provides High availability for SQL Server databases. Capacity of replicate databases to facilitate read only access which makes it possible for them to be used for multiple purposes at the same time. You can use a secondary database for reporting purposes.

Provides a high availability of the database and Do not need any singular hardware.

 


Understanding Relational database scaling

 Many applications today still rely on traditional relational databases like MS SQL, Oracle, MySQL or PostgreSQL for data storage and data processing. With the growing amount of data and new workloads that are made with these database systems, achieving scalability is a huge challenge for relational databases.


In this article, I will discuss addressing the need for database scalability, techniques to scale a relational database.


      

  1.  Addressing the need for database scalability:


  • You’ve reached the capacity number of users that your application is able to handle.

  • Application is beginning to slow down and error out. 

  • Network requests start timing out

  • Database queries are taking a while to execute

. 

  1.  Database scaling solutions:


  • Cache database queries in Memory

  • Replication 

  • Database sharding

  • Denormalization

  • SQL tuning


  1.  Horizontal scalability Vs Vertical scalability:


  • Vertical Scaling -  This process involves adding more power such as CPU and disk power to enhance your storage process.

  • Adv of Vertical Scaling -  Simple and no code change

  • Disadv of Vertical Scaling -  Expensive and Difficult to perform when server exceeds maximum load

  • Horizontal Scaling -  This process involves increasing  the number of machines and dividing the data set and distributing the data over them and improving both read and write performance.

  • Adding read replicas to handle Read-Heavy workloads. Reading from the cache before hitting the primary DB to reduce database load.

  • Adv of Horizontal Scaling -  Cheap and better performance

  • Disadv of Horizontal Scaling -  Making joins is difficult, as it may involve cross-server communication




  1.  Designing the database architecture to achieve scalability:

  • Decentralized data management architecture

  • Each individual microservice have its own separate database

  • Deploy and scale your microservices independently.


  1.  Recent scalability enhancements  in RDBMS

  • Connection Pooling

  • Command Query Responsibility Segregation (CQRS)

  • Multi Primary Replication

  • Data Centre Wise Partition


  1.  Implementing scaling solutions introduces the following complexities:

  • Adding new features takes longer

  • The system becomes more complex with more pieces and variables involved

  • Code can be more difficult to test

  • Finding and resolving bugs becomes harder

September 17, 2021

Azure SQL Database - Encryption at rest




What is encryption at rest?

Encryption is the secure encoding of data used to protect confidentiality of data.

The purpose of encryption at rest

Encryption at rest provides data protection for stored data (at rest). Encryption at rest is designed to prevent the attacker from accessing the unencrypted data by ensuring the data is encrypted when on disk. If an attacker obtains a hard drive with encrypted data but not the encryption keys, the attacker must defeat the encryption to read the data. 

Azure SQL Database

Azure SQL Database currently supports encryption at rest for Microsoft-managed service side and client-side encryption scenarios.

Support for server encryption is currently provided through the SQL feature called Transparent Data Encryption. Once an Azure SQL Database customer enables TDE key are automatically created and managed for them. Encryption at rest can be enabled at the database and server levels. As of June 2017, Transparent Data Encryption (TDE) is enabled by default on newly created databases. Azure SQL Database supports RSA 2048-bit customer-managed keys in Azure Key Vault. 

Client-side encryption of Azure SQL Database data is supported through the Always Encrypted feature. Always Encrypted uses a key that created and stored by the client. Customers can store the master key in a Windows certificate store, Azure Key Vault, or a local Hardware Security Module. Using SQL Server Management Studio, SQL users choose what key they'd like to use to encrypt which column.

For Azure SQL Managed Instance, the TDE protector is set at the instance level and is inherited by all encrypted databases on that instance.

To provide Azure SQL customers with two layers of encryption of data at rest, infrastructure encryption (using AES-256 encryption algorithm) with platform managed keys is being rolled out. This provides an addition layer of encryption at rest along with TDE with customer-managed keys, which is already available. For Azure SQL Database and Managed Instance, all databases, including the master database and other system databases, will be encrypted when infrastructure encryption is turned on.

Transparent data encryption (TDE) helps protect Azure SQL Database, Azure SQL Managed Instance, and Azure Synapse Analytics against the threat of malicious offline activity by encrypting data at rest. It performs real-time encryption and decryption of the database, associated backups, and transaction log files at rest without requiring changes to the application. By default, TDE is enabled for all newly deployed Azure SQL Databases and must be manually enabled for older databases of Azure SQL Database. For Azure SQL Managed Instance, TDE is enabled at the instance level and newly created databases. TDE must be manually enabled for Azure Synapse Analytics.


TDE performs real-time I/O encryption and decryption of the data at the page level. Each page is decrypted when it's read into memory and then encrypted before being written to disk. TDE encrypts the storage of an entire database by using a symmetric key called the Database Encryption Key (DEK). On database startup, the encrypted DEK is decrypted and then used for decryption and re-encryption of the database files in the SQL Server database engine process. DEK is protected by the TDE protector. TDE protector is either a service-managed certificate (service-managed transparent data encryption) or an asymmetric key stored in Azure Key Vault (customer-managed transparent data encryption).

How to check if existing database is Encrypted?

Below query returns information about the encryption state of a database and its associated database encryption keys. 

select db_name(database_id),key_algorithm,encryption_state_desc 

from sys.dm_database_encryption_keys

Creating DataFrames from CSV in Apache Spark

 from pyspark.sql import SparkSession spark = SparkSession.builder.appName("CSV Example").getOrCreate() sc = spark.sparkContext Sp...