March 16, 2024

Creating DataFrames in Apache Spark

 Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. It offers high-level APIs in languages such as Scala, Java, Python, and R, making it accessible to a wide range of developers.

How to Install Apache Spark on Microsoft Windows 10

In Apache Spark, SparkSession is the entry point for working with structured data in Spark, introduced in Spark 2.0. It combines the functionality previously provided by SQLContext, HiveContext, and SparkContext into a single unified interface.

SparkSession provides a unified entry point for interacting with Spark functionality, including SQL, DataFrame, and Dataset operations.


Creating DataFrames: 

SparkSession allows you to create DataFrames from various data sources such as JSON, CSV, Parquet, JDBC, Avro, and more. It provides methods like read and readStream to read data into DataFrames and Datasets.


Create DataFrames from JSON data sources using PySpark:


from pyspark.sql import SparkSession


# Create a SparkSession

spark = SparkSession.builder \

    .appName("JSON Example") \

    .getOrCreate()


# Define the path to the JSON file

json_file_path = "d:/spark/examples/src/main/resources/people.json"


# Create a DataFrame from JSON

people_df = spark.read.json(json_file_path)


# Show the schema of the DataFrame

people_df.printSchema()


# Show the contents of the DataFrame

people_df.show()


>>> people_df.show()

+----+-------+

| age|   name|

+----+-------+

|null|Michael|

|  30|   Andy|

|  19| Justin|

+----+-------+


>>> # Register the DataFrame as a SQL temporary view

>>> people_df.createOrReplaceTempView("people")

>>> sqlDF = spark.sql("SELECT * FROM people")

>>> sqlDF.show()

+----+-------+

| age|   name|

+----+-------+

|null|Michael|

|  30|   Andy|

|  19| Justin|

+----+-------+

# Create a DataFrame from TEXT file

>>> path  = "d:/spark/examples/src/main/resources/people.txt"

>>>

>>> dftext = spark.read.text(path)

>>> dftext.show()

+-----------+

|      value|

+-----------+

|Michael, 29|

|   Andy, 30|

| Justin, 19|

+-----------+

How to Install Apache Spark on Microsoft Windows 10


March 12, 2024

Install and Configure Apache Cassandra on Windows

 


1. Introduction

      Cassandra is a free and open-source, distributed, wide-column store, NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra offers support for clusters spanning multiple data centers,with asynchronous masterless replication allowing low latency operations for all clients. Cassandra was designed to implement a combination of Amazon's Dynamo distributed storage and replication techniques combined with Google's Bigtable data and storage engine model.

       Avinash Lakshman, one of the authors of Amazon's Dynamo, and Prashant Malik initially developed Cassandra at Facebook to power the Facebook inbox search feature. Facebook released Cassandra as an open-source project on Google code in July 2008.In March 2009, it became an Apache Incubator project. On February 17, 2010, it graduated to a top-level project.

      Facebook developers named their database after the Trojan mythological prophet Cassandra, with classical allusions to a curse on an oracle.

 2           Installation and Configuration:

2.1          Installing Cassandra

https://phoenixnap.com/kb/install-cassandra-on-windows

Dependencies

---------Apache Cassandra requires Java 8 to run on a Windows system.

---------Cassandra command-line shell (cqlsh) is dependent on Python 2.7 to work correctly.

To be able to install Cassandra on Windows, first you need to:

  1. Download and Install Java 8 and set environment variables.
  2. Download and install Python 2.7 and set environment variables.

If you already have these dependencies installed, check your version of Python and Java. If you have Java 8 and Python 2.7. feel free to move on to the third section of this guide.

Step 1: Install Java 8 on Windows

-------Download Oracle JDK 8 (Java Development Kit)( Visit the official Oracle download page and download the Oracle JDK 8 software package)

--------Scroll down and locate the Java SE Development Kit 8u251 for Windows x64 download link. The Java 8 download starts automatically after signup.

 

Note: If you do not have an Oracle account, the website guides you through a quick signup process. Alternatively, you can download Java from a third-party website of your choosing. Always make sure to confirm the source of the download.

3. Once the download is complete, double-click the downloaded executable file. Select Next on the initial installation screen.

 4. The following section allows you to select optional features and define the location of the installation folder. Accept the default settings and take note of the full path to the installation folder, C: Program FilesJavajdk1.8.0_251. Once you are ready to proceed with the installation, click Next.

5. The installation process can take several minutes. Select Close once the process is completed.

 Configure Environment Variables for Java 8

--------It is vital to configure the environment variables in Windows and define the correct path to the Java 8 installation folder.

1. Navigate to This PC > Properties.

 2. Select Advanced system settings.

3. Click the Environment Variables

 4. Select New in the System Variable section.

 5. Enter JAVA_HOME for the new variable name. Select the Variable value field and then the Browse Directory option.

 6. Navigate to This PC > Local Disk C: > Program Files > Java > jdk1.8.0_251 and select OK.

7. Once the correct path to the JDK 8 installation folder has been added to the JAVA_HOME system variable, click OK.

 

8. You have successfully added the JAVA_HOME system variable with the correct JDK 8 path to the variable list. Select OK in the main Environment Variables window to complete the process.

 Step 2: Install and Configure Python 2.7 on Windows

-------Users interact with the Cassandra database by utilizing the cqlsh bash shell.

-------- We need to install Python 2.7 for cqlsh to handle user requests properly.

-------Install Python 2.7 on Windows

1. Visit the Python official download page and select the Windows x64 version link.

 2. Define if you would like Python to be available to all users on this machine or just for your user account and select Next.

 3. Specify and take note of the Python installation folder location. Feel free to leave the default location C:Python27 by clicking Next.

 4. The following step allows you to customize the Python installation package. Select Next to continue the installation using the default settings.

5. The installation process takes a few moments. Once it is complete, select Finish to conclude the installation process.

 Edit Environment Variable for Python 2.7

1. Navigate to This PC > Properties.

 2. Select the Advanced system settings option.

 3. Click Environment Variables…

 4. Double-click on the existing Path system variable.

 5. Select New and then Browse to locate the Python installation folder quickly. Once you have confirmed that the path is correct, click OK.

 6. Add the Python 2.7 path to the Path system variable by selecting OK.

 

----------------Step 3: Download and Set Up Apache Cassandra

----------------Download and Extract Cassandra tar.gz Folder

----------------1. Visit the official Apache Cassandra Download page and select the version you would prefer to download. Currently, the latest available version is 3.11.6.

 2. Click the suggested Mirror download link to start the download process.


Note: It is always recommended to verify downloads originating from mirror sites. The instructions for using GPG or SHA-512 for verification are usually available on the official download page.

 4. Unzip the compressed tar.gz folder using a compression tool such as 7-Zip or WinZip. In this example, the compressed folder was unzipped, and the content placed in the C:Cassandraapache-cassandra-3.11.6 folder.

 Configure Environment Variables for Cassandra

Set up the environment variables for Cassandra to enable the database to interact with other applications and operate on Windows.

1. Go to This PC > Properties.

 2. Go to Advanced system settings.

 3. Click the Environment Variables

 4. Add a completely new entry by selecting the New option.

 5. Type CASSANDRA_HOME for Variable name, then for theVariable value column select the location of the unzipped Apache Cassandra folder.

Based on the previous steps, the location is C:Cassandraapache-cassandra-3.11.6. Once you have confirmed that the location is correct, click OK.

 

6. Double click on the Path variable.

 7. Select New and then Browse. In this instance, you need to add the full path to the bin folder located within the Apache Cassandra folder, C:Cassandraapache-cassandra-3.11.6bin.

 8. Hit the OK button and then again OK to save the edited variables.

 Step 4: Start Cassandra from Windows CMD

Navigate to the Cassandra bin folder. Start the Windows Command Prompt directly from within the bin folder by typing cmd in the address bar and pressing Enter.

 Type the following command to start the Cassandra server:

cassandra

The system proceeds to start the Cassandra Server.

 

Do not close the current cmd session.

Step 5: Access Cassandra cqlsh from Windows CMD

While the initial command prompt is still running open a new command line prompt from the same bin folder. Enter the following command to access the Cassandra cqlsh bash shell:

cqlsh

You now have access to the Cassandra shell and can proceed to issue basic database commands to your Cassandra server.


You have successfully installed Cassandra on Windows.

 


March 07, 2024

How to Install Apache Spark on Microsoft Windows 10




Apache Spark is an open-source Big Data processing framework for large volumes of data from multiple sources. Spark is used in distributed computing for processing machine learning applications, data analytics, and graph-parallel processing on single-node machines or clusters. 

This blog post will show you how to install Apache Spark on Windows 10 and test the installation.

Step 1: Install Java 8

1.1 Download Java https://java.com/en/download/.

1.2 Install Java

1.3 Configure Environment variable JAVA_HOME and for  Java JDK directory (example, C:\Program Files\Java\<jdk_version>).




1.4 Check Java Version using Command Prompt.

java -version



Step 2: Install Python

2.1 Download Python 3.11 from https://www.python.org/

2.2 Install Python 3.11

python --version



Step 3: Configure Hadoop

3.1 Download the winutils.exe file https://github.com/cdarlint/winutils

3.2 Create folder C:\Hadoop\bin

3.3 Copy the winutils.exe file to C:\Hadoop\bin

3.4 Configure Environment variable HADOOP_HOME for directory C:\Hadoop




3.5 Configure path %HADOOP_HOME%\bin




Step 4: Install Spark

4.1 Download https://spark.apache.org/downloads.html

4.2 Create a new folder named Spark

4.3 Extract Spark zip to C:\Spark 

4.4 Configure Environment variable SPARK_HOME and for  Apache Python directory (example, C:\Spark\spark-3.5.0-bin-hadoop3).

4.5 Configure path SPARK_HOME%\bin


Step 5:  Launch Spark with Command Prompt

5.1 Open Command Prompt

C:\Spark\spark-3.5.0-bin-hadoop3\bin\spark-shell






5.2 Browse http://localhost:4040/.

You should see an Apache Spark shell Web UI. 



Creating DataFrames and Datasets in Apache Spark

https://avishkarm.blogspot.com/2024/03/creating-dataframes-and-datasets-in.html

January 24, 2024

Data Harmonization: Building a Single Version of Truth


What is data harmonization?
Data harmonization is about standardizing & integrating data from different fields, formats, and dimensions. Learn more about its process & best practices.
Having access to clean, high-quality data allows you to analyze sales, marketing efforts, and other factors that contribute to your company’s success. 

The data harmonization process:

  • Delivers data in the way that you analyze it internally (internal language), as well as the way outside vendors and partners need it (external language)
  • Creates hierarchies that allow for big-picture views, and ensures that hierarchies are consistent across data sources
  • Provides enough granularity to make decisions, but not so much detail that it’s difficult to sort through the data

  • Data harmonization utilizes master data to align data within sources (e.g. standardizing product names within a sales database) as well as across sources (e.g. reconciling social media data that may report weekly data as Sunday-Saturday, vs. retail channel data that may report weekly sales as Monday-Sunday).

How Does Data Harmonization Benefit Businesses?
In simple terms, data harmonization increases the value and utilization of data. Data harmonization also makes it possible for organizations to transform fragmented and inaccurate data into workable information—creating new analyses, insights, and visualizations. This means that data harmonization helps the user reduce the time taken to access business intelligence, discover key insights, and detect early disruptions. It also significantly lowers the overall cost of complex data analysis and the cost of handling data in the long run. If an organization is spending less time scrambling to find the right source of data, then it can spend that time more effectively elsewhere, such as in growing the business and making a significant revenue impact.

Whether an organization has been around for several decades or is a recent start-up, it will inevitably gather a plethora of data. Along with it, there is the distinct possibility that the enormous array of information gathered from a wide variety of sources will have errors and misinformation. Besides this, the sheer volume of information collected over a company's lifespan can be unwieldy and overwhelming.

With data harmonization tools, this data can be a valuable mine of insights and business intelligence. Organizations can learn things about their customers, changing market forces, and even insights about competitors. The good news is that every company across the globe is mining and storing data to make smart business decisions and manage their customers. But first, to make sense of all that data, organizations need to harmonize it.

Most companies spend huge amounts of time and resources on commissioning surveys, conducting focus group sessions, and gathering information from the internet, news channels, and social media networks. All this information does not come together in one manageable, cohesive body but rather as a mish-mash of raw data. To make sense of it as a whole, it needs to be harmonized. Raw, unharmonized data isn’t suitable for business analysis. It often contains irrelevant pointers, misleading values, and duplicate statistics. However, when organizations use data harmonization techniques, they can standardize data and create a single source of verifiable information.


At its simplest, data harmonization enhances the quality and utility of business data. However, data harmonization also makes it possible for business users to transform data and create new data analyses and visualizations without IT involvement. Thus, data harmonization significantly decreases the time to create and access business intelligence insights, while also lowering the total cost of data analysis.

Who uses data harmonization?
Data harmonization technology is applicable in a variety of business functions, particularly sales and marketing. As a relatively new approach to data analysis and visualization, data harmonization is not yet widely used or understood.

Best practices in harmonization
Harmonization is typically a mix of automated steps (often using artificial intelligence) and manual efforts, with leading vendors automating 60 percent or more of the process. The goal is to use artificial intelligence as much as possible in order to reduce errors and shorten the time to insight.

  • Create data models that meet future plans as well as immediate needs
  • Offer deep industry and category expertise, which saves you time
  • Provide a no-code environment that lets data analysts harmonize the data directly


January 10, 2024

SQL Injection

 SQL Injection 


SQL injection is an attack in which malicious code is inserted into strings that are later passed to an instance of SQL Server for parsing and execution. Any procedure that constructs SQL statements should be reviewed for injection vulnerabilities because SQL Server will execute all syntactically valid queries that it receives. Even parameterized data can be manipulated by a skilled and determined attacker.



How SQL Injection Works

The primary form of SQL injection consists of direct insertion of code into user-input variables that are concatenated with SQL commands and executed. A less direct attack injects malicious code into strings that are destined for storage in a table or as metadata. When the stored strings are subsequently concatenated into a dynamic SQL command, the malicious code is executed.

The injection process works by prematurely terminating a text string and appending a new command. Because the inserted command may have additional strings appended to it before it is executed, the malefactor terminates the injected string with a comment mark "--". Subsequent text is ignored at execution time.



The following script shows a simple SQL injection. The script builds an SQL query by concatenating hard-coded strings together with a string entered by the user:


C#



var vcustname;  

custname= Request.form ("vcustname");  

var sql = "select * from OrdersTable where custname= '" + custname+ "'";  

The user is prompted to enter the name of a Customer. If she enters “Robert”, the query assembled by the script looks similar to the following:


SQL


Copy

SELECT * FROM OrdersTable WHERE custname= 'Robert'  


However, assume that the user enters the following:


Robert'; drop table OrdersTable--  


In this case, the following query is assembled by the script:


SQL


Copy

SELECT * FROM OrdersTable WHERE custname= 'Robert';drop table OrdersTable--'  



The semicolon (;) denotes the end of one query and the start of another. The double hyphen (--) indicates that the rest of the current line is a comment and should be ignored. If the modified code is syntactically correct, it will be executed by the server. When SQL Server processes this statement, SQL Server will first select all records in OrdersTable where custname= 'Robert'. Then, SQL Server will drop OrdersTable.


How to prevent SQL Injection?


You must validate all user input and carefully review code that executes constructed SQL commands in the server that you are using. 

Follow Coding best practices 


  • Validate the size, type, or content of the data that is received by your application

  • Never concatenate user input that is not validated. String concatenation is the primary point of entry for script injection.

  • Never build Transact-SQL statements directly from user input.

  • Stored procedures may be susceptible to SQL injection if they use unfiltered input. If you use stored procedures, you should use parameters as their input.

  • Use the Parameters Collection with Dynamic SQL

  • If you are using a LIKE clause, wildcard characters still must be escaped.

  • Filtering input may also be helpful in protecting against SQL injection by removing escape characters.

  • How will your application behave if an errant or malicious user enters a 10-megabyte MPEG file where your application expects a postal code?

  • How will your application behave if a DROP TABLE, DELETE FROM  statement is embedded in a text field?

  • Test the size and data type of input and enforce appropriate limits. This can help prevent deliberate buffer overruns.

  • Test the content of string variables and accept only expected values. Reject entries that contain binary data, escape sequences, and comment characters. This can help prevent script injection and can protect against some buffer overrun exploits.

  • When you are working with XML documents, validate all data against its schema as it is entered.

  • In multitiered environments, all data should be validated before admission to the trusted zone. Data that does not pass the validation process should be rejected and an error should be returned to the previous tier.

  • Implement multiple layers of validation. Precautions you take against casually malicious users may be ineffective against determined attackers. A better practice is to validate input in the user interface and at all subsequent points where it crosses a trust boundary. For example, data validation in a client-side application can prevent simple script injection. However, if the next tier assumes that its input has already been validated, any malicious user who can bypass a client can have unrestricted access to a system.







MS SQL Query Store usage scenarios

 The Query Store feature provides you with insight on query plan choice and performance for SQL Server, Azure SQL Database, Azure SQL Managed Instance, and Azure Synapse Analytics. The Query Store simplifies performance troubleshooting by helping you quickly find performance differences caused by query plan changes. Query Store automatically captures a history of queries, plans, and runtime statistics, and retains these for your review. It separates data by time windows so you can see database usage patterns and understand when query plan changes happened on the server. 


Query Store can be used in wide set of scenarios when tracking and ensuring predictable workload performance is critical. Here are some examples you can consider:


  • Quickly find and fix a plan performance regression by forcing the previous query plan. Fix queries that have recently regressed in performance due to execution plan changes.

  • Determine the number of times a query was executed in a given time window, assisting a DBA in troubleshooting performance resource problems.

  • Identify top n queries (by execution time, memory consumption, etc.) in the past x hours.

  • Audit the history of query plans for a given query.

  • Analyze the resource (CPU, I/O, and Memory) usage patterns for a particular database.

  • Identify top n queries that are waiting on resources.

  • Understand wait nature for a particular query or plan.

Creating DataFrames in Apache Spark

 Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. It offers high-level APIs i...