March 07, 2024

How to Install Apache Spark on Microsoft Windows 10




Apache Spark is an open-source Big Data processing framework for large volumes of data from multiple sources. Spark is used in distributed computing for processing machine learning applications, data analytics, and graph-parallel processing on single-node machines or clusters. 

This blog post will show you how to install Apache Spark on Windows 10 and test the installation.

Step 1: Install Java 8

1.1 Download Java https://java.com/en/download/.

1.2 Install Java

1.3 Configure Environment variable JAVA_HOME and for  Java JDK directory (example, C:\Program Files\Java\<jdk_version>).




1.4 Check Java Version using Command Prompt.

java -version



Step 2: Install Python

2.1 Download Python 3.11 from https://www.python.org/

2.2 Install Python 3.11

python --version



Step 3: Configure Hadoop

3.1 Download the winutils.exe file https://github.com/cdarlint/winutils

3.2 Create folder C:\Hadoop\bin

3.3 Copy the winutils.exe file to C:\Hadoop\bin

3.4 Configure Environment variable HADOOP_HOME for directory C:\Hadoop




3.5 Configure path %HADOOP_HOME%\bin




Step 4: Install Spark

4.1 Download https://spark.apache.org/downloads.html

4.2 Create a new folder named Spark

4.3 Extract Spark zip to C:\Spark 

4.4 Configure Environment variable SPARK_HOME and for  Apache Python directory (example, C:\Spark\spark-3.5.0-bin-hadoop3).

4.5 Configure path SPARK_HOME%\bin


Step 5:  Launch Spark with Command Prompt

5.1 Open Command Prompt

C:\Spark\spark-3.5.0-bin-hadoop3\bin\spark-shell






5.2 Browse http://localhost:4040/.

You should see an Apache Spark shell Web UI. 



Creating DataFrames and Datasets in Apache Spark

https://avishkarm.blogspot.com/2024/03/creating-dataframes-and-datasets-in.html

No comments:

Creating DataFrames from CSV in Apache Spark

 from pyspark.sql import SparkSession spark = SparkSession.builder.appName("CSV Example").getOrCreate() sc = spark.sparkContext Sp...