January 08, 2020

Amazon Athena


Overview of Athena
Amazon Athena is an interactive query service, which developers and data analysts use to analyze data stored in Amazon S3. Athena’s serverless architecture lowers operational costs and means users don’t need to scale, provision or manage any servers.

Amazon Athena users can use standard SQL when analysing data. Athena does not require a server, so there is no need to oversee infrastructure; users only pay for the queries they request. You don’t even need to load your data into Athena, just need to point to their data in Amazon S3, define the schema, and begin querying.

To get started, just log into the Athena Management Console, define your schema, and start querying. Amazon Athena uses Presto with full standard SQL support and works with a variety of standard data formats, including CSV, JSON, ORC, Apache Parquet and Avro. While Amazon Athena is ideal for quick, ad-hoc querying and integrates with Amazon QuickSight for easy visualization, it can also handle complex analysis, including large joins, window functions, and arrays.



Some Athena Facts
  • Athena supports only EXTERNAL tables, when you drop a table in Athena, only the table metadata is removed; the data remains in Amazon S3
  • Athena uses an approach known as schema-on-read
  • Athena does not modify your data in Amazon S3
  • Athena uses Apache Hive to define tables and create databases, which are essentially a logical namespace of tables
  • Athena can only query the latest version of data on a versioned Amazon S3 bucket, and cannot query previous versions of the data.
  • Athena does not support querying the data in the GLACIER storage class
  • Athena performs full table scans instead of using indexes
  • Athena supports ACID-compliant.
  • Athena is case-insensitive and turns table names and column names to lower case.
  • Athena table, view, database, and column names cannot contain special characters, other than underscore (_)


No comments:

Creating DataFrames from CSV in Apache Spark

 from pyspark.sql import SparkSession spark = SparkSession.builder.appName("CSV Example").getOrCreate() sc = spark.sparkContext Sp...