Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. It offers high-level APIs in languages such as Scala, Java, Python, and R, making it accessible to a wide range of developers.
How to Install Apache Spark on Microsoft Windows 10
In Apache Spark, SparkSession is the entry point for working with structured data in Spark, introduced in Spark 2.0. It combines the functionality previously provided by SQLContext, HiveContext, and SparkContext into a single unified interface.
SparkSession provides a unified entry point for interacting with Spark functionality, including SQL, DataFrame, and Dataset operations.
Creating DataFrames:
SparkSession allows you to create DataFrames from various data sources such as JSON, CSV, Parquet, JDBC, Avro, and more. It provides methods like read and readStream to read data into DataFrames and Datasets.
Create DataFrames from JSON data sources using PySpark:
from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder \
.appName("JSON Example") \
.getOrCreate()
# Define the path to the JSON file
json_file_path = "d:/spark/examples/src/main/resources/people.json"
# Create a DataFrame from JSON
people_df = spark.read.json(json_file_path)
# Show the schema of the DataFrame
people_df.printSchema()
# Show the contents of the DataFrame
people_df.show()
>>> people_df.show()
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
>>> # Register the DataFrame as a SQL temporary view
>>> people_df.createOrReplaceTempView("people")
>>> sqlDF = spark.sql("SELECT * FROM people")
>>> sqlDF.show()
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
# Create a DataFrame from TEXT file
>>> path = "d:/spark/examples/src/main/resources/people.txt"
>>>
>>> dftext = spark.read.text(path)
>>> dftext.show()
+-----------+
| value|
+-----------+
|Michael, 29|
| Andy, 30|
| Justin, 19|
+-----------+
No comments:
Post a Comment