Data Engineering with Avishkar: 01/01/2016

January 27, 2016

Amazon RDS - Backup and Restore

Amazon RDS - Backup

FAQ: It is possible to create a backup of a database running on an Amazon RDS instance and restore it on a local machine?

Ans:

1. You can’t currently create a .bak file out of amazon rds.

2. Use the Azure migration wizard with amazon rds to copy the RDS database to the ec2 instance/Local server.

Once that is done you can create a .bak file from the SQL Server running on the EC2 instance. if you have the bandwidth or your database is small you may be able to use the migration tool directly on your target machine.

You can create a backup locally from AWS RDS. Using SQL Management Studio, right-click your database > Task > Export Data

----Import/Export wizard or Bulk Copy Program (BCP) for SQL Server

1) By using SQLAzureMW Tool

2) For databases of 1 GB or larger, it is more efficient to script only the database schema and then use the Import and Export Wizard or the bulk copy feature of SQL Server to transfer the data.

Backup using SQLAzureMW Tool

BCP Script which is automatically generated.

Result Summary

Amazon RDS -Restore

Apply script to target server.

Tables migrated.

Interesting Facts About Database Indexes

Interesting Facts About Database Indexes

Create Clustered Indexes on the tables’ Natural Keys. Natural Keys, are the fields that best identify the row’s data. For example, the Primary Key used for a table might be a Unique Identifier or a Big Int, but I might access data through a combination of columns. For a car these columns could be Year, Make, Model and VIN. When you create the Clustered Index, order the columns by their selectivity and use this same order when you query for the information.
Create Non-Clustered Indexes on your Primary Keys
Create Non-Clustered Indexes for all Foreign Keys
Create Non-Clustered Indexes for columns that are used in WHERE, ORDER BY, MERGE, JOIN and other clauses that require matching data.
Create Filtered INDEXES to create highly selective sets of keys for columns that may not have a good selectivity otherwise.
Use Covering INDEXEs to reduce the number of bookmark lookups required to gather data that is not present in the other INDEXES.
Covering INDEXES can be used to physically maintain the data in the same order as is required by the queries’ result sets reducing the need for SORT operations.
Covering INDEXES have an increased maintenance cost, therefore you must see if performance gain justifies the extra maintenance cost.
If a CLUSTERED INDEX is present on the table, then NONCLUSTERED INDEXES will use its key instead of the table ROW ID.
To reduce the size consumed by the NONCLUSTERED INDEXES it’s imperative that the CLUSTERED INDEX KEY is kept as narrow as possible.
Physical reorganization of the CLUSTERED INDEX does not physically reorder NONCLUSTERED INDEXES.
SQL Database can JOIN and INTERSECT INDEXES in order to satisfy a query without having to read data directly from the table.
Favor many narrow NONCLUSTERED INDEXES that can be combined or used independently over wide INDEXES that can be hard to maintain.
NONCLUSTERED INDEXES can reduce blocking by having SQL Database read from NONCLUSTERED INDEX data pages instead of the actual tables.

January 21, 2016

Business Intelligence - Comparision Matrix

-->

BI Product	Tableau	Qlik- QlikView	SAS	Microstrategy
Company	Tableau Software	QlikTech	SAS Institute	Microstrategy
Product URL	http://www.tableau.com/	http://www.qlik.com/	https://www.sas.com/	www.microstrategy.com
Segment	Leaders	Leaders	Leaders	Leaders
STRENGTHS	Users can leverage the power of Self Service analytics with cool visualizations, drangging and dropping objects, measures and dimensions.	Scripting has ablity to make clean data and do best data modelling. Patent associative tecnology.		Schema & SQL Engine uniqueness
WEAKNESSES	Data integration can be quite complex and without a clean data source the software isn't much use.	GUI is sill not good as compared to Tableau or Qlik Sense		Development is not good in Desktop & Web and many options miss in web.
Deployment Platforms	Windows	Windows	Windows,Linux	Windows,Linux
Self-service tools	Yes	Yes	Yes	Yes
Mobility	Yes	Yes	Yes	Yes

January 13, 2016

Data Warehouse in 2016

Data Warehouse in 2016

Data Warehouse Vendors in 2016 will distinguish innovations and feature enhancements in the areas of:

Integration with in-memory architectures to enable real-time analytics
Integration with Hadoop to support larger ingestion and transformation
Leveraging native data compression capabilities to secure sensitive data
Ability to simplify integration via data virtualization
Enabling in-database analytics to support sophisticated requirements
Analytic data platforms - Real time,ready-to-use tools—native SQL, integration with the R programming language, and data mining algorithms
Modern data types : Mobile devices, social media traffic, networked sensors (i.e. the Internet of Things)

Data Warehouse - Copmarision Matrix

DW Product	IBM - DB2 Data Warehouse	Oracle - Exadata	Teradata
Company	IBM has four major businesses: hardware, software, services, and financing. Data warehousing is part of the data management business, which is part of the software business.	Oracle has three businesses: database, applications, and consulting. Its database business is the largest by far and currently represents 80 percent of Oracle’s new license revenues.	NCR has four businesses: data warehousing, financial self-service, retail store automation, and customer services. The Teradata Division is responsible for the data warehousing business
Product URL	www.ibm.com/DataWarehousing‎	https://www.oracle.com/database/data-warehouse/index.html	www.teradata.com/
STRENGTHS	- Rich and flexible data partitioning capabilities -strong analytic functionality in OLAP and data mining - Market presence - Data Models - Hardware bundle - Partner network - Strong services arm (IBM GSA)	- Intelligent Storage Grid - Hybrid Columnar Compression - Smart Flash Cache	- Massively-parallel, partitioned, - Shared-nothing database server architecture - its simple and highlyautomated physical data warehouse implementation -set of indexing approaches that enable fast access to data - scalable hybrid-storage capabilities - Teradata has buddied up with all enterprise Hadoop distro providers, enabled new analytic workloads to be added to Teradata systems (JSON, geospatial, 3D geospatial and others) and more.
WEAKNESSES	- complex physical implementation - lack of integration with multidimensional - High cost of ownership - DB2 on the open systems platforms continues to suffer from locking problems. • Closed systems; can’t easily ride cost curves associated with commodity hardware. • Expensive fault tolerant solution compared with Exadata		- its incomplete visual tools for build and manage functionality - Proprietary hardware - Costly to maintain and upgrade - Limited skilled implementation expertise
Deployment Platforms	IBM AIX Microsoft Windows Linux Sun Solaris	IBM AIX Hewlett-Packard HP-UX Linux Microsoft Windows Sun Solaris	NCR SVR4 UNIX MP-RAS Microsoft Windows
Server Architecture	Server platform with a single processor Single database partition on a server platform with a multiple processors Multiple partition configurations • Shared-nothing • Multiple server platforms • Server interconnect • Any number of readers and writers	Single server platform Distributed database Real Application Cluster (RAC) • Shared, partitioned data • Multiple server platforms • Server interconnect • Any number of readers and writers	Single and multiple node organization where a node is a hardware and software platform specialized and dedicated to data warehousing Teradata Warehouse is a sharednothing architecture in both its single and multiple node configurations
Data Type Support	SQL types: • Numeric • Binary • Character • Date time DATALINK XML Large objects (max 2 gigabytes) User-defined types (distinct— renamed SQL types, structured— object oriented, reference— hierarchies of built-in types)	Oracle built-in data types (SQL types): • Numeric • Binary • Character • Date time Large objects (max 2 gigabytes) User-defined types (object-oriented types, object identifier types, arrays, nested tables) Oracle-supplied types • Spatial • Media • Text • XML	SQL types: • Numeric • Binary • Character • Date time Large objects
Physical Design Recommendation	Neutral on the physical design of data warehouses.	Neutral on the physical design of data warehouses.	Teradata is neutral on the physical design of data warehouses but recommends a physical design of third normal form for data warehouses to maximize flexibility. Teradata further recommends that denormalized structures be implemented as views or redundant structures (logical data marts or special purpose tables).
Physical Implementation	Manual	Manual Template-based via templates and Database Configuration Assistant (DBCA) tool Automated via Oracle managed files	Automated
Custom Transformations	May be written in: SQL Java C++	May be written in: SQL PL/SQL	May be written in: SQL C++
Summary Table Support	Materialized query tables automate the creation and management of summary tables. A materialized query table stores the results of a query in a table	Materialized views automate the creation and management of summary tables. A materialized view stores the results of a query in a table	The OLAP transformations of Teradata Warehouse Miner can create and manage summary tables.
SQL Extensions	CUBE and ROLLUP in SELECT Functions • Aggregate • Numeric • Statistical • Correlation • Random number generation • Regression • Date time User-defined	CUBE and ROLLUP in SELECT Functions • Ranking • Window aggregate • Reporting aggregate • Lag/lead • Linear regression • Inverse percentile • Hypothetical rank and distribution • First/last • Numeric • Date time User-defined	QUALIFY, SAMPLE, and WITH in SELECT Functions and operators • Aggregate • Numeric • Date time • OLAP
OLAP	DB2 provides OLAP build and manage capabilities, relational OLAP on DB2 tables, and multidimensional and hybrid OLAP on a combination of DB2 tables and external multidimensional structures. DB2 OLAP Server is a separately- priced and -packaged product that is an external, but tightly integrated, multidimensional OLAP facility that IBM OEMs from Hyperion Solutions.	Oracle OLAP is a separatelypackaged and -priced product that provides OLAP functionality	Provides relational OLAP on Teradata Warehouse tables
Data Mining	DB2 Intelligent Miner is bundled with DB2 Data Warehouse Enterprise Edition.	Oracle Data Mining is a separately- priced and -packaged product.	Teradata Warehouse Miner is a separately-packaged and -priced product that is tightly integrated with Teradata Warehouse.

Data Warehouse- Teradata

Company - DW Product	Teradata
Company	NCR has four businesses: data warehousing, financial self-service, retail store automation, and customer services. The Teradata Division is responsible for the data warehousing business
Product URL	www.teradata.com/
STRENGTHS	- Massively-parallel, partitioned, - Shared-nothing database server architecture - its simple and highlyautomated physical data warehouse implementation -set of indexing approaches that enable fast access to data - scalable hybrid-storage capabilities - Teradata has buddied up with all enterprise Hadoop distro providers, enabled new analytic workloads to be added to Teradata systems (JSON, geospatial, 3D geospatial and others) and more.
WEAKNESSES	- its incomplete visual tools for build and manage functionality - Proprietary hardware - Costly to maintain and upgrade - Limited skilled implementation expertise
Deployment Platforms	NCR SVR4 UNIX MP-RAS Microsoft Windows
Server Architecture	Single and multiple node organization where a node is a hardware and software platform specialized and dedicated to data warehousing Teradata Warehouse is a sharednothing architecture in both its single and multiple node configurations
Data Type Support	SQL types: • Numeric • Binary • Character • Date time Large objects
Physical Design Recommendation	Teradata is neutral on the physical design of data warehouses but recommends a physical design of third normal form for data warehouses to maximize flexibility. Teradata further recommends that denormalized structures be implemented as views or redundant structures (logical data marts or special purpose tables).
Physical Implementation	Automated
Custom Transformations	May be written in: SQL C++
Summary Table Support	The OLAP transformations of Teradata Warehouse Miner can create and manage summary tables.
SQL Extensions	QUALIFY, SAMPLE, and WITH in SELECT Functions and operators • Aggregate • Numeric • Date time • OLAP
OLAP	Provides relational OLAP on Teradata Warehouse tables
Data Mining	Teradata Warehouse Miner is a separately-packaged and -priced product that is tightly integrated with Teradata Warehouse.

Oracle Data Warehouse - Exadata

-->

Company - DW Product	Oracle - Exadata
Company	Oracle has three businesses: database, applications, and consulting. Its database business is the largest by far and currently represents 80 percent of Oracle’s new license revenues.
Product URL	https://www.oracle.com/database/data-warehouse/index.html
STRENGTHS	- Intelligent Storage Grid - Hybrid Columnar Compression - Smart Flash Cache
WEAKNESSES
Deployment Platforms	IBM AIX Hewlett-Packard HP-UX Linux Microsoft Windows Sun Solaris
Server Architecture	Single server platform Distributed database Real Application Cluster (RAC) • Shared, partitioned data • Multiple server platforms • Server interconnect • Any number of readers and writers
Data Type Support	Oracle built-in data types (SQL types): • Numeric • Binary • Character • Date time Large objects (max 2 gigabytes) User-defined types (object-oriented types, object identifier types, arrays, nested tables) Oracle-supplied types • Spatial • Media • Text • XML
Physical Design Recommendation	Neutral on the physical design of data warehouses.
Physical Implementation	Manual Template-based via templates and Database Configuration Assistant (DBCA) tool Automated via Oracle managed files
Custom Transformations	May be written in: SQL PL/SQL
Summary Table Support	Materialized views automate the creation and management of summary tables. A materialized view stores the results of a query in a table
SQL Extensions	CUBE and ROLLUP in SELECT Functions • Ranking • Window aggregate • Reporting aggregate • Lag/lead • Linear regression • Inverse percentile • Hypothetical rank and distribution • First/last • Numeric • Date time User-defined
OLAP	Oracle OLAP is a separatelypackaged and -priced product that provides OLAP functionality
Data Mining	Oracle Data Mining is a separately- priced and -packaged product.

IBM - DB2 Data Warehouse

Company - DW Product	IBM - DB2 Data Warehouse
Company	IBM has four major businesses: hardware, software, services, and financing. Data warehousing is part of the data management business, which is part of the software business.
Product URL	www.ibm.com/DataWarehousing‎
STRENGTHS	- Rich and flexible data partitioning capabilities -strong analytic functionality in OLAP and data mining - Market presence - Data Models - Hardware bundle - Partner network - Strong services arm (IBM GSA)
WEAKNESSES	- complex physical implementation - lack of integration with multidimensional - High cost of ownership - DB2 on the open systems platforms continues to suffer from locking problems. • Closed systems; can’t easily ride cost curves associated with commodity hardware. • Expensive fault tolerant solution compared with Exadata
Deployment Platforms	IBM AIX Microsoft Windows Linux Sun Solaris
Server Architecture	Server platform with a single processor Single database partition on a server platform with a multiple processors Multiple partition configurations • Shared-nothing • Multiple server platforms • Server interconnect • Any number of readers and writers
Data Type Support	SQL types: • Numeric • Binary • Character • Date time DATALINK XML Large objects (max 2 gigabytes) User-defined types (distinct— renamed SQL types, structured— object oriented, reference— hierarchies of built-in types)
Physical Design Recommendation	Neutral on the physical design of data warehouses.
Physical Implementation	Manual
Custom Transformations	May be written in: SQL Java C++
Summary Table Support	Materialized query tables automate the creation and management of summary tables. A materialized query table stores the results of a query in a table
SQL Extensions	CUBE and ROLLUP in SELECT Functions • Aggregate • Numeric • Statistical • Correlation • Random number generation • Regression • Date time User-defined
OLAP	DB2 provides OLAP build and manage capabilities, relational OLAP on DB2 tables, and multidimensional and hybrid OLAP on a combination of DB2 tables and external multidimensional structures. DB2 OLAP Server is a separately- priced and -packaged product that is an external, but tightly integrated, multidimensional OLAP facility that IBM OEMs from Hyperion Solutions.
Data Mining	DB2 Intelligent Miner is bundled with DB2 Data Warehouse Enterprise Edition.

Data Engineering with Avishkar

January 27, 2016

Amazon RDS - Backup and Restore

Amazon RDS - Backup and Restore

Interesting Facts About Database Indexes

January 21, 2016

Business Intelligence - Comparision Matrix

January 13, 2016

Data Warehouse in 2016

Data Warehouse - Copmarision Matrix

Data Warehouse- Teradata

Oracle Data Warehouse - Exadata

IBM - DB2 Data Warehouse

Fashion Catalog Similarity Search using Datastax AstraDB Vector Database

Search This Blog