Data Engineering with Avishkar: Creating DataFrames in Apache Spark

March 16, 2024

Creating DataFrames in Apache Spark

Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. It offers high-level APIs in languages such as Scala, Java, Python, and R, making it accessible to a wide range of developers.

How to Install Apache Spark on Microsoft Windows 10

In Apache Spark, SparkSession is the entry point for working with structured data in Spark, introduced in Spark 2.0. It combines the functionality previously provided by SQLContext, HiveContext, and SparkContext into a single unified interface.

SparkSession provides a unified entry point for interacting with Spark functionality, including SQL, DataFrame, and Dataset operations.

Creating DataFrames:

SparkSession allows you to create DataFrames from various data sources such as JSON, CSV, Parquet, JDBC, Avro, and more. It provides methods like read and readStream to read data into DataFrames and Datasets.

Create DataFrames from JSON data sources using PySpark:

from pyspark.sql import SparkSession

# Create a SparkSession

spark = SparkSession.builder \

.appName("JSON Example") \

.getOrCreate()

# Define the path to the JSON file

json_file_path = "d:/spark/examples/src/main/resources/people.json"

# Create a DataFrame from JSON

people_df = spark.read.json(json_file_path)

# Show the schema of the DataFrame

people_df.printSchema()

# Show the contents of the DataFrame

people_df.show()

>>> people_df.show()

+----+-------+

| age| name|

+----+-------+

|null|Michael|

| 30| Andy|

| 19| Justin|

+----+-------+

>>> # Register the DataFrame as a SQL temporary view

>>> people_df.createOrReplaceTempView("people")

>>> sqlDF = spark.sql("SELECT * FROM people")

>>> sqlDF.show()

+----+-------+

| age| name|

+----+-------+

|null|Michael|

| 30| Andy|

| 19| Justin|

+----+-------+

# Create a DataFrame from TEXT file

>>> path = "d:/spark/examples/src/main/resources/people.txt"

>>>

>>> dftext = spark.read.text(path)

>>> dftext.show()

+-----------+

| value|

+-----------+

|Michael, 29|

| Andy, 30|

| Justin, 19|

+-----------+

Data Engineering with Avishkar

March 16, 2024

Creating DataFrames in Apache Spark

How to Install Apache Spark on Microsoft Windows 10

How to Install Apache Spark on Microsoft Windows 10

No comments:

Fashion Catalog Similarity Search using Datastax AstraDB Vector Database

Search This Blog