October 08, 2014

Data Compression in QlikView

QlikView uses an associative in-memory technology to allow users to analyze and process data very quickly. Unique entries are only stored once in-memory thus removing repetition in the data: everything else are pointers to the parent data. That’s why QlikView is faster and stores more data in memory than traditional cubes.
 e.g. if you have a million rows of data but a field contains only 100 unique values, then QlikView stores those 100 values only and not a million values. This applies equally to the RAM QlikView will need to hold the data as well as the hard disk space needed to store the QVW and QVD files. Since Qlikview is an in momry associative technology, you're limited only by the number of records you can fit in memory.
The limitations of the application are restricted by the physical capacity of the server running the document, the complexity of that document, and the density of the data.

If the report loads sales order data, for example, and the data contains customer name then there may be many duplicate values in the data for any customer who has ordered many times. QV will only store the customer name once and then keep track of where that customer name is used again as subsequent rows are loaded. 

You can even use this fact to your advantage when dealing with large volumes of data in QlikView. Breaking down a field into several fields can cause more repetition and thus reduce the amount of resources QV needs to hold the data both in RAM and on disk. Email addresses are prime candidates for this technique as the domain part will likely not be unique to one entry. Take this example:

Email Address

abcd@qlik.com
xyz@qlik.com
lmn@qlik.com


QlikView will consider each of these values as unique and thus will be storing the "@qlik.com" part repeatedly. If you were to split the email addresses at the "@” then the domain part becomes the same for all 3 records and thus QV will store it only once. You can split the email address using the following lines within a load statement:

....
Left (Email,index(Email,'@')-1) AS EmailName,
Right (Email, len(Email)-index(Email, '@')) AS EmailDomain
....

When wanting to display the email address in a chart you only have to concatenate the 2 fields together remembering to replace the '@' in the middle like this:

EmailName & '@' & EmailDomain

The same technique can be used for mail addresses, phone numbers, part codes and any other patterned, repetitive data. 

No comments:

Creating DataFrames from CSV in Apache Spark

 from pyspark.sql import SparkSession spark = SparkSession.builder.appName("CSV Example").getOrCreate() sc = spark.sparkContext Sp...