December 05, 2014

Pentaho Data Integration


Pentaho Data Integration (also called Kettle) is the component of Pentaho responsible for the Extract, Transform and Load (ETL) processes. It can be used for following purposes:
  • Data Warehouse
  • Migrating data between applications or databases
  • Exporting data from databases to flat files
  • Loading data massively into databases
  • Data cleansing
  • Integrating applications

Spoon:
Spoon is the graphical tool with which you design and test every PDI process.
In Spoon, you build Jobs and Transformations. PDI offers two methods to save them:
Database repository and Files
If you choose the repository method, the repository has to be created the first time you execute Spoon. If you choose the files method, the Jobs are saved in files with the kjb extension, and the Transformations are in files with the ktr extension.
Starting Spoon
Start Spoon by executing spoon.bat on Windows, or spoon.sh on Unix-like operating systems. As soon as Spoon starts, a dialog window appears asking for the repository connection data. Click the No Repository button.


No comments:

Creating DataFrames from CSV in Apache Spark

 from pyspark.sql import SparkSession spark = SparkSession.builder.appName("CSV Example").getOrCreate() sc = spark.sparkContext Sp...