Data Engineering with Avishkar

February 15, 2013

Scaling Out SQL Server

Scalability is the ability of an application to efficiently use more resources in order to do more useful work. For example, an application that can service four users on a single-processor system may be able to service 15 users on a four-processor system. In this case, the application is scalable. If adding more processors doesn't increase the number of users serviced (if the application is single threaded, for example), the application isn't scalable.

There are two kinds of scalability: scaleup and scaleout.
Scaleup means scaling to a bigger, more powerful server—going from a four-processor server to a 64-processor or 128-processor server, for example. This is the most common way for databases to scale. When your database runs out of resources on your current hardware, you go out and buy a bigger box with more processors and more memory. Scaleup has the advantage of not requiring significant changes to the database. In general, you just install your database on a bigger box and keep running the way you always have, with more database power to handle a heavier load.

Scaleout means expanding to multiple servers rather than a single, bigger server. Scaleout usually has some initial hardware cost advantages—eight four-processor servers generally cost less than one 32-processor server. Scaleout is separating or partitioning the database system in a manner so you can take those parts and place them on separate database servers. This allows you to spread processing power across as many servers as necessary to accommodate expanding growth. However, additional features and functionality require additional complexity. A scale out database scenario is not a particularly easy one to design or administer. You must answer many difficult business and technology-driven questions before you can successfully implement a scale out of a database system.

There is no thumb rule for Scaleup and Scaleout.. i.e. if hardware cost is less than licensing and maintenance costs then Scaleup is better than Scaleout.If one machine out of your N machine fails, it's less important. The system will still be up and running. And, it's not only failures but hardware/OS/software updates/upgrades, then Scaleout is better than Scaleup

February 13, 2013

Hurray! I have one of the top 10% most viewed @LinkedIn profiles for 2012

February 07, 2013

What is Big Data

What is Big Data?

Big Data is a massive collection of data produced by multiple traffic sources which is constantly being updated – the very nature of Big Data means it’s complex and almost impossible to even get a handle on in the first place, let alone break down, assess and produce tangible results and recommendations that companies can learn from.
With Big Data, traditional web analytics is just the tip of the iceberg. we still need to know what traffic we’re getting, where it’s coming from and which journeys customers are taking when they arrive on the site, but in order to run a successful eCommerce store, we also need to take into account and learn from other data which is out of our control and not necessarily ours to “own”.

Big Data a massive volume of both structured and unstructured data that is so large that it's difficult to process using traditional database and software techniques.

The "structured" portion of Big Data refers to fixed fields within a database. For ecommerce merchants, this could be customer data — address, zip code — that's stored in a shopping cart.

The "unstructured" part encompasses email, video, tweets, and Facebook Likes. None of the unstructured data resides in a fixed database that's accessible to merchants.product reviews, social media data and images – things you know are out there and relate to your business but things you can’t necessarily get a hold of! But the feedback from, say, social media has become a very useful research tool for businesses.

January 04, 2013

How SQL Server Reporting Services (SSRS) works?

August 18, 2012

Calling Stored Procedures from .NET Applications

Calling Stored Procedures from .NET Applications

In .NET, there are three very similar ways of accessing SQL Server.

• System.Data.ODBC

• System.Data.SQLClient

• System.Data.OLEDB

Stored Procedures provide more alternatives in the way they can pass data back to the application. Stored Procedures tend to work faster, are much more secure, are more economical with server memory, and can contain a lot more logic. Additionally, it makes teamwork easier: As long as the name of the stored procedure, what it does, and the parameters remain the same, it also allows someone else in the team to work on the database code without you having to change your client software.

Security:

Stored procedures are a counter-measure to dangerous SQL Script injection attacks, a susceptibility that applications using embedded SQL are more vulnerable to.

In the Microsoft SQL Server environment, SQL injection attacks can be prevented using parameters, with or without SPs. Earlier I said this is a damaging argument, and by that I mean it is damaging to programmers who cannot use SPs: They will leave their applications more vulnerable to attack than they should because of this bit of misinformation.

Additionally, using stored procedures lets you use the SqlParameter class available in ADO.NET to specify data types for stored procedure parameters. This gives you an easy way to validate the types of user-provided values as part of an in-depth defensive strategy. To be sure, parameters are just as useful in in-line queries as they are in stored procedures in narrowing the range of acceptable user input.

Stored procedures allow for better data protection by controlling how the data is accessed. By granting a database login EXECUTE permissions on stored procedures you can specify limited actions to the application. Additionally, stored procedures are a counter-measure to dangerous SQL Script injection attacks, a susceptibility that applications using embedded SQL are more vulnerable to.

Performance:

When stored procedures are used SQL Server can cache the ‘execution plan’ that it uses to execute the SQL vs. having to recalculate the execution plan on each request .

The cached execution plan used to give stored procedures a performance advantage over queries. However, for the last couple of versions of SQL Server, execution plans are cached for all T-SQL batches, regardless of whether or not they are in a stored procedure. Therefore, performance based on this feature is no longer a selling point for stored procedures. Any T-SQL batch with static syntax that is submitted frequently enough to prevent its execution plan from aging out of memory will receive identical performance benefits. The "static" part is key; any change, even something as insignificant as the addition of a comment, will prevent matching with a cached plan and thereby prevent plan re-use.

However, stored procedures can still provide performance benefits where they can be used to reduce network traffic. You only have to send the EXECUTE stored_proc_name statement over the wire instead of a whole T-SQL routine, which can be pretty extensive for complex operations. A well-designed stored procedure can reduce many round trips between the client and the server to a single call.

Additionally, using stored procedures allows you to enhance execution plan re-use, and thereby improve performance, by using remote procedure calls (RPCs) to process the stored procedure on the server. When you use a SqlCommand.CommandType of StoredProcedure, the stored procedure is executed via RPC. The way RPC marshals parameters and calls the procedure on the server side makes it easier for the engine to find the matching execution plan and simply plug in the updated parameter values.

One last thing to think about when considering using stored procedures to enhance performance is whether you are leveraging T-SQL strengths. Think about what you want to do with the data.

Are you using set-based operations, or doing other operations that are strongly supported in T-SQL? Then stored procedures are an option, although in-line queries would also work.

Are you trying to do row-based operations, or complex string manipulation? Then you probably want to re-think doing this processing in T-SQL, which excludes using stored procedures, at least until SQL Server 2005 is released and Common Language Runtime (CLR) integration is available.

It’s also worth mentioning that the easiest way to get performance out of your database is to do everything you can to take advantage of the platform you are running on.

Use the power of the database to thresh the wheat from the chaff. Use your business logic to turn the wheat into dough. In many cases you can get better performance by looping and filtering data in SQL Server than you could performing the same loops and filters in the Data Access Layer – databases are intrinsically designed to do this, while you and I have to writeour own code (which do you think is going to be faster?). It is, however, important to understand how SQL Server uses indexes and clustered indexes.

Beyond sorting and filtering data in the stored procedures you can also batch common work together or retrieve multiple sets of data. For example, retrieve some data, update a datetime stamp, and then insert a new record. If you were to execute these 3 tasks once a second as ad-hoc SQL this would result in 259,200/day independent database request vs. 86,400/day if all were encapsulated in a stored procedure. That’s 172,800 database connections and network IO usages that you no longer require! Consolidating work through stored procedures makes more effective use of a connection (and your system).

If you use parameterized queries instead of strictly ad-hoc sql statements, performance benefit same in both using stored procedures.But paramaterized queries still suffer from poor security design. Ultimately without access only via stored procs I have access to the underlying tables and can do anything I want to them.

The cached execution plan used to give stored procedures a performance advantage over queries and can provide performance benefits where they can be used to reduce network traffic.

Maintenance:

The another potential benefit to consider is maintainability. In a perfect world, your database schema would never change and your business rules would never get modified, but in the real world these things happen. That being the case, it may be easier for you if you can modify a stored procedure to include data from the new X, Y, and Z tables that have been added to support that new sales initiative, instead of changing that information somewhere in your application code. Changing it in the stored procedure makes the update transparent to the application—you still return the same sales information, even though the internal implementation of the stored procedure has changed. Updating the stored procedure will usually take less time and effort than changing, testing, and re-deploying your assembly.

Also, by abstracting the implementation and keeping this code in a stored procedure, any application that needs access to the data can get it in a uniform manner. You don't have to maintain the same code in multiple places, and your users get consistent information.

Another maintainability benefit of storing your T-SQL in stored procedures is better version control. You can version control the scripts that create and modify your stored procedures, just as you can any other source code module. By using Microsoft Visual SourceSafe® or some other source control tool, you can easily revert to or reference old versions of the stored procedures.

One caveat with using stored procedures to enhance maintainability is they cannot insulate you from all possible changes to your schemas and rules. If the changes are large enough to require a change in the parameters fed into the stored procedure, or in the data returned by it, then you are still going to have to go in and update the code in your assembly to add parameters, update GetValue() calls, and so forth.

Another issue to consider is that using stored procedures to encapsulate business logic limits your application portability, in that it ties you to SQL Server. If application portability is critical in your environment, encapsulating business logic in a RDBMS-neutral middle tier may be a better choice.

Updating the stored procedure will usually take less time and effort than changing, testing, and re-deploying your assembl

You don't have to maintain the same code in multiple places, and your users get consistent information.

Abstraction:

Stored procedures provide abstraction between the data and the business logic layer. The data model can be dramatically changed, but the stored procedures can still return identical data.

DISADVANTAGES:-

1.Stored procedure languages are quite often vendor-specific. Switching to another vendor's database most likely requires rewriting any existing stored procedures.

2.Stored procedure languages from different vendors have different levels of sophistication.

For example, Oracle's PL/SQL has more languages features and built-in features (via packages such as DBMS_ and UTL_ and others) than Microsoft's T-SQL.[citation needed]

3.Tool support for writing and debugging stored procedures is often not as good as for other programming languages, but this differs between vendors and languages.

For example, both PL/SQL and T-SQL have dedicated IDEs and debuggers. PL/PgSQL can be debugged from various IDEs.

Conclusion:

when it comes to using stored procedures there is NO downside. Applications can be build more securely, are easier to maintain, and typically perform better.

August 17, 2012

How to allow a SQL Login to see only ONE database

How to allow a SQL Login to see only ONE database

On one server, there are a lot of other databases. Sometimes the databases are created as per client name and so when we log with one client login credentials, he can see which are other clients I am working with.
We do not want them to be able to see all the other databases on the instance. They have access to only one database and that is the only one that they should see in object explorer.

To implement this, we can use following steps.

e.g. we want to allow client1_login to see only client1_db database.

USE [master]
GO
-- make sure they can view all databases for the moment.
GRANT VIEW ANY DATABASE TO client1_login
GO

USE client1_db
go

-- drop the user in the database if it already exists.
IF EXISTS (SELECT *
FROM sys.database_principals
WHERE name = N'client1_login ')
DROP USER client1_login
GO

-- grant them ownership to of the database (ownership of dbo schema).
ALTER AUTHORIZATION ON DATABASE::client1_db to client1_login
go

USE MASTER
go

-- deny ability to see ohter databases
DENY VIEW ANY DATABASE TO client1_login
go

May 11, 2012

Features Comparison of BI Tools

User Experience :

Microstrategy	Qlikview	Pentaho
MicroStrategy Web user interface adheres to an “Extreme AJAX” model where processing is shifted from the Web server to the Web browsers, making for a more responsive Web interface that increases user productivity and improves user adoption	Qlikview multiple Web interfaces intended for different deployment requirements. Because each interface has different capabilities, developers are typically forced to make tradeoffs between functionality and deployment requirements	Pentaho web interface offers very limited functionality. It lacks familiar Microsoft paradigms, making the end user experience less intuitive. Enterprise reports created using the Pentaho Web are limited to basic reports without any graphs, charts or crosstabs, severely limiting end user experience and self-service capabilities

Performance:

Microstrategy	Qlikview	Pentaho
ROLAP architecture which leverages the database for much of its processing. Data joins and analytic calculations are processed in the database whenever possible. MicroStrategy’s multi-pass approach provides the flexibility to answer any analytical question in the most optimal manner.	QlikView stores all data and performs all calculations in memory on the middle-tier server. QlikView does not fully leverage the relational database or the hard disk on the middle-tier. These aspects of the QlikView architecture result in inefficient resource utilization and limit QlikView’s scalability.	Pentaho ROLAP engine does not provide fully implemented multi-source ROLAP and multipass SQL engines. The Pentaho ROLAP engine is unable to leverage the database to its fullest extent possible, resulting in unnecessary network and hardware resources utilization

Deployment and Administration

Microstrategy	Qlikview	Pentaho
Provides organizations a platform that is quick to implement and deploy as well as easy to maintain and administer, fueled by a single code base that offers the advantage of reusable business logic across the entire platform. MicroStrategy’s single BI server provides efficient, centralized administration for IT and fewer moving parts which translate into less downtime.	QlikView lacks a common reusable metadata layer that is shared across documents. This creates a maintenance challenge as developers are typically forced to continually and manually synchronize metric definitions and security profiles across documents.	Pentaho lacks a unified and reusable metadata layer creating maintenance challenge and promotes “multiple versions of the truth.” The administration console provides control over only a subset of administrative tasks. Have fewer tools to centrally monitor and manage the BI applications, thus more administrators per number of end users. Lacks enterprise features like clustering and load balancing, increasing the administration complexity and increasing the IT workload.

Drawback:

Microstrategy	Qlikview	Pentaho
Reusable metadata is easier to maintain requiring less redundancy, end users have more self-service capabilities that offload work from the IT staff, t provides a comprehensive suite of administrative tools requiring fewer IT administrators	Developers are forced to create redundant metadata objects as the metadata objects they create cannot be reused across multiple reports, causing unnecessary development and maintenance efforts.	Developers are forced to create redundant metadata objects as the metadata objects they create cannot be reused across multiple reports, causing unnecessary development and maintenance efforts.