August 28, 2010

Index guidelines

Clustered index guidelines:

Good when queries select large number of adjacent rows (range queries)
Create on the frequently used columns (in JOINs and WHERE with “=“,

“<“, “>“, “BETWEEN”)
If number of returned rows is small – non-clustered index may be as

efficient
Preferred on narrow and highly selective columns..

Remember cost of maintenance:
Updates reorganize the table
Performance impact
Causes index fragmentation over time

Non-clustered index guidelines:

Create for frequent search columns
Use on narrow and highly selective columns
Place on foreign key constraints (for join queries)
Check the workload for “covering” queries
Consider adding included columns

The drawback: maintenance cost
Frequent updates will ruin perf where there are too many indexes
Evaluate benefits of [not] indexing small tables

Creating DataFrames from CSV in Apache Spark

 from pyspark.sql import SparkSession spark = SparkSession.builder.appName("CSV Example").getOrCreate() sc = spark.sparkContext Sp...