April 15, 2009

SQL Server - Fragmentation

Even though SQL Server is very good at more or less automatically maintaining most statistics, physical data structuring and configurations etc (especially compared to some of it's competitors), there are still some situations where you need to perform a little manual work to help it. One of those areas is fragmentation.

Different types of fragmentation
Fragmentation occurs as data is modified. There are two major distinctions regarding fragmentation, disk fragmentation and SQL Server fragmentation.

Logical disk fragmentation:
The database files can be logically fragmented in the file system just like any other operating system files. This occurs when the file system cannot create the file in one logically contigous space. The result of this is that the disk head must move back and forth over the disk when reading from the file. To remove this type of fragmentation you can use the defragmentation tools that are included in Windows or use some third-party application. Remember though that SQL Server must be stopped during the defragmentation process because when it is started the database files are in use by SQL Server, even if no user is actually using that specific database, and files in use can not be defragged.

Avoiding fragmentation of database files:
To keep fragmentation in the database files at a minimum without defragging them you should avoid using the autogrow setting in SQL Server for them. Preallocate enough size for them from the start and when growth still is needed try to increase the size in large chunks at once. Also make sure you keep the database files on separate disks away from other files to make sure logical contigous space is available for them. Separating the data files from the log file is helpful and also important for performance in other ways, since the log file is written to sequentially.

SQL Server Index FragmentationThe other major type of fragmentation, and the one that is the most interesting to us, is SQL Server fragmentation. This means that the data stored inside the database files is fragmented on an index level basis, regardless of whether the database file is logically fragmented or not. The easiest way to describe it in a couple of words it to say that it prevents SQL Server from being able to use indexes in an optimal way. Index fragmentation can be either internal or external. Note that external fragmentation is not the same thing as logical disk fragmentation, though the name might imply it. Each of these types are described in detail below.
Internal fragmentation
Pages that have a lot of free space are said to be internally fragmented. This is caused by rows that are removed by DELETE statements or when pages are split and only filled to about half. Empty space on pages means there are less rows per page, which in turn means more page reads. Systems that are read-intensive can see a significant degrade in performance from these extra reads. Even for applications that are defined as OLTP (OnLine Transaction Processing) systems, i.e. write-intensive systems, the ratio between reads and writes is usually more than 80/20.
Sometimes necessary and wanted:
However, page splits are very expensive operations. Sometimes the tradeoff of extra page reads during read operations for fewer page splits during write operations is correct. For systems that really are write-intensive, or perhaps specific operations that generate many writes, this might be necessary. There is really no correct advice regarding this, as it is completely different between different applications and systems.

External fragmentation:
When tables grow as new data is inserted into them the ideal situation is that pages get allocated to an extent, then when that extent is filled up a new extent is allocated, then filled up with pages and so on. This way all the pages will always be located adjacent to eachother and readingeight pages will in a best case scenario only require reading one extent from disk.

Reality, however, is seldom ideal. New rows, when inserted, are bound to cause page splits sooner or later, as will updates that cause variable sized rows to grow too large to fit on the original page. When a new page is created from a page split it will be allocated in the same extent as the original page if there is room for it. If the extent is already full, which will normally be the case in a production system, a new extent will be allocated to the index and the new page will be placed there. The new page will not be contigous to the original page, so when following the page chain using the previous and next pointers in the page headers SQL Server will need to perform an extra extent switch. In a worst case scenario reading eight pages in order might require seven extent switches, and eight extents must be read from disk.

Ordered scans are affected:
When reading individual rows external fragmentation will not affect the performance at all, and the same goes for unordered scans. The reason for the latter is that unordered scans can use the IAM pages to find which extents need to be fetched in a very efficient way. It is only for scans that need to fetch the pages in logical order that the page chain is used and external fragmentation might become a degrading factor for performance.

Viewing fragmentation information:
To view how much fragmentation an index has you use DBCC SHOWCONTIG. It can show information for all indexes on a table or just a single index at a time. Using the option TABLERESULTS you get extra columns in the output that describe statistics about the index. By default DBCC SHOWCONTIG scans the page chain at the leaf level, but it is possible to scan all levels of an index.
When reviewing the output from DBCC SHOWCONTIG you should pay special attention to the following metrics:
· Avg. Page Density (full): Shows the average level of how filled the pages are. A percentage means the pages are almost full, and a low indicates much free space. This value should be compared to the fill factor setting specified when the index was created to decide whether or not the index is internally fragmented.
· Scan Density: Shows the ratio between the Best Count of extents that should be necessary to read when scanning all the pages of the index, and the Actual Count of extents that was read. This percentage should be as close to 100% as possible. Defining an acceptable level is difficult, but anything under 75% definitely indicates external fragmentation.
· Logical Scan Fragmentation: Shows the ratio of pages that are out of logical order. The value should be as close to 0% as possible and anything over 10% indicates external fragmentation.

Two things to know about DBCC SHOWCONTIG’ output and the numbers in Scan Denisty, Logical Scan Fragmentation and Extent Scan Fragmentation
1) If your indexes are small, then these values do not matter at all. Why? Because SQL Server has two types of extents: Uniform Extents and Mixed Extents. Uniform extents are extents that are owned by a single object where as mixed extents are used for more than one smaller objects. In the case of a uniform extent, all the 8 pages for the extent (an extent = 8 pages = 64KB) are allocated to one single object. That is why in smaller objects, you will see that the scan density will be low, extent scan fragmentation will be high since the object will be using say a single page in N extents. If the objects are large, then SQL Server will be using the uniform extents and you will see scan density closer to 100% if there is no fragmentation.
2) Another thing to note is that the extent scan fragmentation data that is shown in the output of DBCC SHOWCONTIG does not represent the true value if there are multiple files in the filegroup and the index belongs to that filegroup. This is actually also documented in BOL.

Defragging:
There are four different ways to defragment an index.
  • Drop and recreate index Pros: Completely rebuilds the index. Reorders and compacts the index pages, and removes unnecessary pages. Cons: Index is not available during process which means that queries will suffer. The process will block all queries accessing the index and also be blocked itself by other processes using the index, unable to do it’s work until those processes are finished. This option is especially bad for clustered indexes as dropping a clustered index means that all non-clustered indexes must be rebuilt (to use RID as bookmark), and then recreating the clustered index will once again force a rebuild of all non-clustered indexes.
  • CREATE INDEX ... WITH DROP_EXISTING Pros: Rebuilds index in one step which is good for clustered indexes, as it means that non-clustered indexes do not need to be rebuilt. Can be used to recreate (and thereby defragment) indexes created by constraints, if the index definition matches the requirements of the constraint. Cons: Potential blocking problems with other processes in the same way as for drop and recreate index above.
  • DBCC DBREINDEX Pros: Does the best job of removing internal and external fragmentation since it physically assigns new pages to the index and moves data to them. Has the possibility to rebuild all indexes on a table using one statement. Recreates indexes for constraints without forcing you to know their requirements. Cons: The same blocking issues as for the two above. Runs in a transaction so all work that has been done is rolled back if the command is stopped.
  • DBCC INDEXDEFRAG Pros: Reorders the leaf pages in index order – enhancing performance for scans especially – and compacts them using the setting specified for fill factor when the index was created. Empty pages are removed. This is an online operation, meaning it does not take long-term locks and thereby do not block other processes and do not get blocked itself by them. Work that has already been done will still be done if the command is cancelled. Cons: The fact that this is an online operation can also have a negative impact, as this means that DBCC INDEXDEFRAG will simply skip pages that are used by other processes. Very fragmented indexes will take a long time to defragment, so they should probably be rebuilt instead if possible. Generates a lot of log activity.

No comments:

Creating DataFrames from CSV in Apache Spark

 from pyspark.sql import SparkSession spark = SparkSession.builder.appName("CSV Example").getOrCreate() sc = spark.sparkContext Sp...