Data Engineering with Avishkar: Creating DataFrames from CSV in Apache Spark

March 28, 2024

Creating DataFrames from CSV in Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("CSV Example").getOrCreate()

sc = spark.sparkContext

Spark SQL provides spark.read().csv("file_name") to read a file or directory of files in CSV format into Spark DataFrame, and dataframe.write().csv("path") to write to a CSV file. Function option() can be used to customize the behavior of reading or writing, such as controlling behavior of the header, delimiter character, character set, and so on.

# A CSV dataset is pointed to by path.

# The path can be either a single CSV file or a directory of CSV files

path = "D:/spark/data/csv/sales.csv"

df = spark.read.csv(path)

df.show()

# Read a csv with delimiter and a header

df_header = spark.read.option("delimiter", ",").option("header", True).csv(path)

df_header.show()

Creating DataFrames in Apache Spark

Data Engineering with Avishkar

March 28, 2024

Creating DataFrames from CSV in Apache Spark

No comments:

Fashion Catalog Similarity Search using Datastax AstraDB Vector Database

Search This Blog