January 24, 2024

Data Harmonization: Building a Single Version of Truth


What is data harmonization?
Data harmonization is about standardizing & integrating data from different fields, formats, and dimensions. Learn more about its process & best practices.
Having access to clean, high-quality data allows you to analyze sales, marketing efforts, and other factors that contribute to your company’s success. 

The data harmonization process:

  • Delivers data in the way that you analyze it internally (internal language), as well as the way outside vendors and partners need it (external language)
  • Creates hierarchies that allow for big-picture views, and ensures that hierarchies are consistent across data sources
  • Provides enough granularity to make decisions, but not so much detail that it’s difficult to sort through the data

  • Data harmonization utilizes master data to align data within sources (e.g. standardizing product names within a sales database) as well as across sources (e.g. reconciling social media data that may report weekly data as Sunday-Saturday, vs. retail channel data that may report weekly sales as Monday-Sunday).

How Does Data Harmonization Benefit Businesses?
In simple terms, data harmonization increases the value and utilization of data. Data harmonization also makes it possible for organizations to transform fragmented and inaccurate data into workable information—creating new analyses, insights, and visualizations. This means that data harmonization helps the user reduce the time taken to access business intelligence, discover key insights, and detect early disruptions. It also significantly lowers the overall cost of complex data analysis and the cost of handling data in the long run. If an organization is spending less time scrambling to find the right source of data, then it can spend that time more effectively elsewhere, such as in growing the business and making a significant revenue impact.

Whether an organization has been around for several decades or is a recent start-up, it will inevitably gather a plethora of data. Along with it, there is the distinct possibility that the enormous array of information gathered from a wide variety of sources will have errors and misinformation. Besides this, the sheer volume of information collected over a company's lifespan can be unwieldy and overwhelming.

With data harmonization tools, this data can be a valuable mine of insights and business intelligence. Organizations can learn things about their customers, changing market forces, and even insights about competitors. The good news is that every company across the globe is mining and storing data to make smart business decisions and manage their customers. But first, to make sense of all that data, organizations need to harmonize it.

Most companies spend huge amounts of time and resources on commissioning surveys, conducting focus group sessions, and gathering information from the internet, news channels, and social media networks. All this information does not come together in one manageable, cohesive body but rather as a mish-mash of raw data. To make sense of it as a whole, it needs to be harmonized. Raw, unharmonized data isn’t suitable for business analysis. It often contains irrelevant pointers, misleading values, and duplicate statistics. However, when organizations use data harmonization techniques, they can standardize data and create a single source of verifiable information.


At its simplest, data harmonization enhances the quality and utility of business data. However, data harmonization also makes it possible for business users to transform data and create new data analyses and visualizations without IT involvement. Thus, data harmonization significantly decreases the time to create and access business intelligence insights, while also lowering the total cost of data analysis.

Who uses data harmonization?
Data harmonization technology is applicable in a variety of business functions, particularly sales and marketing. As a relatively new approach to data analysis and visualization, data harmonization is not yet widely used or understood.

Best practices in harmonization
Harmonization is typically a mix of automated steps (often using artificial intelligence) and manual efforts, with leading vendors automating 60 percent or more of the process. The goal is to use artificial intelligence as much as possible in order to reduce errors and shorten the time to insight.

  • Create data models that meet future plans as well as immediate needs
  • Offer deep industry and category expertise, which saves you time
  • Provide a no-code environment that lets data analysts harmonize the data directly


January 10, 2024

SQL Injection

 SQL Injection 


SQL injection is an attack in which malicious code is inserted into strings that are later passed to an instance of SQL Server for parsing and execution. Any procedure that constructs SQL statements should be reviewed for injection vulnerabilities because SQL Server will execute all syntactically valid queries that it receives. Even parameterized data can be manipulated by a skilled and determined attacker.



How SQL Injection Works

The primary form of SQL injection consists of direct insertion of code into user-input variables that are concatenated with SQL commands and executed. A less direct attack injects malicious code into strings that are destined for storage in a table or as metadata. When the stored strings are subsequently concatenated into a dynamic SQL command, the malicious code is executed.

The injection process works by prematurely terminating a text string and appending a new command. Because the inserted command may have additional strings appended to it before it is executed, the malefactor terminates the injected string with a comment mark "--". Subsequent text is ignored at execution time.



The following script shows a simple SQL injection. The script builds an SQL query by concatenating hard-coded strings together with a string entered by the user:


C#



var vcustname;  

custname= Request.form ("vcustname");  

var sql = "select * from OrdersTable where custname= '" + custname+ "'";  

The user is prompted to enter the name of a Customer. If she enters “Robert”, the query assembled by the script looks similar to the following:


SQL


Copy

SELECT * FROM OrdersTable WHERE custname= 'Robert'  


However, assume that the user enters the following:


Robert'; drop table OrdersTable--  


In this case, the following query is assembled by the script:


SQL


Copy

SELECT * FROM OrdersTable WHERE custname= 'Robert';drop table OrdersTable--'  



The semicolon (;) denotes the end of one query and the start of another. The double hyphen (--) indicates that the rest of the current line is a comment and should be ignored. If the modified code is syntactically correct, it will be executed by the server. When SQL Server processes this statement, SQL Server will first select all records in OrdersTable where custname= 'Robert'. Then, SQL Server will drop OrdersTable.


How to prevent SQL Injection?


You must validate all user input and carefully review code that executes constructed SQL commands in the server that you are using. 

Follow Coding best practices 


  • Validate the size, type, or content of the data that is received by your application

  • Never concatenate user input that is not validated. String concatenation is the primary point of entry for script injection.

  • Never build Transact-SQL statements directly from user input.

  • Stored procedures may be susceptible to SQL injection if they use unfiltered input. If you use stored procedures, you should use parameters as their input.

  • Use the Parameters Collection with Dynamic SQL

  • If you are using a LIKE clause, wildcard characters still must be escaped.

  • Filtering input may also be helpful in protecting against SQL injection by removing escape characters.

  • How will your application behave if an errant or malicious user enters a 10-megabyte MPEG file where your application expects a postal code?

  • How will your application behave if a DROP TABLE, DELETE FROM  statement is embedded in a text field?

  • Test the size and data type of input and enforce appropriate limits. This can help prevent deliberate buffer overruns.

  • Test the content of string variables and accept only expected values. Reject entries that contain binary data, escape sequences, and comment characters. This can help prevent script injection and can protect against some buffer overrun exploits.

  • When you are working with XML documents, validate all data against its schema as it is entered.

  • In multitiered environments, all data should be validated before admission to the trusted zone. Data that does not pass the validation process should be rejected and an error should be returned to the previous tier.

  • Implement multiple layers of validation. Precautions you take against casually malicious users may be ineffective against determined attackers. A better practice is to validate input in the user interface and at all subsequent points where it crosses a trust boundary. For example, data validation in a client-side application can prevent simple script injection. However, if the next tier assumes that its input has already been validated, any malicious user who can bypass a client can have unrestricted access to a system.







MS SQL Query Store usage scenarios

 The Query Store feature provides you with insight on query plan choice and performance for SQL Server, Azure SQL Database, Azure SQL Managed Instance, and Azure Synapse Analytics. The Query Store simplifies performance troubleshooting by helping you quickly find performance differences caused by query plan changes. Query Store automatically captures a history of queries, plans, and runtime statistics, and retains these for your review. It separates data by time windows so you can see database usage patterns and understand when query plan changes happened on the server. 


Query Store can be used in wide set of scenarios when tracking and ensuring predictable workload performance is critical. Here are some examples you can consider:


  • Quickly find and fix a plan performance regression by forcing the previous query plan. Fix queries that have recently regressed in performance due to execution plan changes.

  • Determine the number of times a query was executed in a given time window, assisting a DBA in troubleshooting performance resource problems.

  • Identify top n queries (by execution time, memory consumption, etc.) in the past x hours.

  • Audit the history of query plans for a given query.

  • Analyze the resource (CPU, I/O, and Memory) usage patterns for a particular database.

  • Identify top n queries that are waiting on resources.

  • Understand wait nature for a particular query or plan.

Creating DataFrames from CSV in Apache Spark

 from pyspark.sql import SparkSession spark = SparkSession.builder.appName("CSV Example").getOrCreate() sc = spark.sparkContext Sp...