Online Reorg

Introduction

Online Reorg is an availability enhancement to DB2 introduced in DB2 Version 5. It increases availability of a tablespace and/or index while it is being reorganised.

The reorganisation of an object necessarily consists of the destruction of an object and its rebuilding in reorganised form. Because of this, the data in the object cannot be available to applications throughout the time of the reorg. Online reorg attempts to reduce the period during which the data is unavailable.

Prior to Online Reorg

Traditionally, tablespace reorg consists of the following steps:

unload of the data
redefinition of the tablespace
reload of the data
rebuild of the indexes

The following diagram shows the steps which take place in a reorg:

During the unload, other users of the system can read the tablespace. However they cannot update it because they might update parts of the tablespace which have already been unloaded, so their updates would be lost.

During the redefinition, reload and rebuild, the data is either absent or incomplete - users must be prevented from reading as well as updating as they will get inconsistent data.

Online Reorg

Online reorg works by reloading the reorged tablespace and rebuilding the indexes into different datasets (shadow tablespaces and indexes). It then switches the original datasets and the new ones at the end of the reorg - the idea is that the objects are only unavailable for this short time at the end, not during the reload and rebuild phases which can be protracted for large objects.

Online reorg comes in two flavours - one which allows applications to read and update during the reorg, and one which allows only reading.

SHRLEVEL

A new keyword, SHRLEVEL, was introduced in DB2 Version 5 to control the invocation of online reorg. It has three settings:

SHRLEVEL NONE Same as 'traditional' reorg, i.e. online reorg is not invoked.

SHRLEVEL CHANGE Online reorg allows both readers and writers

SHRLEVEL REFERENCE Online reorg allows only readers

How Does it Work?

The following diagram shows what happens during an online reorg:

Differences for Partitioned Tablespaces

If a single partition of a partitioned tablespace is reorganised (which is the general recommendation), the reorg of the tablespace and index partitions proceeds as for a non-partitioned tablespace. If there are non-partitioning indexes, they need to be processed differently. A non-partitioning index has entries which relate to rows in all the partitions of the tablespace. It is not processed by the shadow copy method, so the entries which relate to the target partition have to be updated separately. This is known as the BUILD2 phase. However, shadow copies are created for the NPIs, but these are just temporary holding areas and are deleted afterwards. Note also that the NPI updating causes logging to take place (this is whether using online reorg or non-online).

Online Reorg Phases

The phases of the reorg overlap in places, reducing elapsed time:

UNLOAD

Rows are unloaded from the tablespace. As they are unloaded, they are passed to a parallel task for sorting in memory. When the rows are sorted, they are output to the RELOAD phase.

RELOAD

The reload phase loads the rows into the shadow tablespace as they are passed from the sortout part of the unload phase. Keys for the indexes are passed to the sort phase. The mapping table is built, and a full image copy is written.

SORT

When all the keys have been received from the RELOAD phase, they are sorted and then passed to the BUILD phase.

BUILD

The indexes are built from the sorted keys. In V6, SORT and BUILD are combined into one SORTBUILD phase. Multiple indexes can be sorted and built in parallel.

LOG

Updates which have been made to the original tablespace and indexes must now be made to the shadow ones. Changes from log records are applied to the shadow tablespace rows and index entries. The mapping table is used to find the new location of the rows. This process happens iteratively - during each iteration, log records are applied to the shadow objects, and more changes may be made to the old objects. When reorg estimates that the duration of the next iteration will be sufficiently small, it drains writers so that no further updates take place and does the last iteration.

SWITCH

All readers and writers are drained. The old objects and shadow objects are renamed so that the shadow objects become the new tablespaces and indexes.

BUILD2

This only happens when reorging a partition of a partitioned tablespace. The index entries in the non-partitioned indexes which point to the reorged partition are updated so that their RIDs point to the correct place.

The Mapping Table

Each row in the tablespace has a row identifier (RID). This is basically the address of the row - it is a concatenation of the row's partition plus page number plus location in the page. When a row is loaded into the shadow tablespace, its position, and therefore its RID, will have changed. The mapping table holds an entry for each row in the tablespace showing the correspondence between its old RID and its new one. This is so that the updates which have happened to the old tablespace rows during the reorg can be applied to the new tablespace rows during the LOG phase.

Enforced Keywords

Some keywords are always used for online reorg whether specified or not.

LOG NO

LOG NO is always used for online reorg (LOG YES cannot be specified). The inserts are happening to the shadow tablespace, not the 'real' tablespace which may still be being updated.

NOSYSREC

A SYSREC dataset is not needed because the rows are unloaded into the shadow tablespace.

COPYDDN

An image copy must be taken because LOG NO is enforced and the tablespace would be left in COPY PENDING status otherwise.

SORTKEYS

Extra Keywords

MAXRO

The maximum length of time the last iteration of log apply should take. Reorg maintains statistics as it iterates to estimate how long the next iteration should take. Once it determines the next iteration will take less than the MAXRO time (default 300 seconds), it prepares for the last iteration by draining writers and/or readers.

LONGLOG CONTINUE|TERM|DRAIN

If each iteration of the log apply is not sufficiently shorter than the last, reorg will first send a message to the console, then keep going CONTINUE, term the utility TERM, or drain all writers DRAIN.

MAPPING TABLE

Specifies the name of the mapping table for SHRLEVEL CHANGE.

DELAY

How long reorg will wait between sending the LONGLOG message, and taking the LONGLOG action. The default is 1200 seconds.

TIMEOUT ABEND|TERM

Specifies what reorg will do if it gets a timeout while trying to drain during the LOG or SWITCH phases. ABEND will leave the objects in a restricted status, while TERM will terminate the utility.

DEADLINE

Deadline specifies the time by which the reorg should have finished. The parameter can be a time or date, and can be specified as an absolute value, or relative to current date or time.

DRAIN

Online reorg drains writers at the start of the last iteration of log processing. Sometimes if readers are allowed, a kind of undetected deadlock between reorg and a reader can occur, leading to timeouts. In this case, use DRAIN ALL to drain readers as well.

Application Concurrency

For most of the period of the reorg, updating (for SHRLEVEL CHANGE) and reading (for SHRLEVEL CHANGE or REFERENCE) is allowed. However, there are up to three periods which run consecutively which reduce or prohibit access.

During the last iteration of log apply for SHRLEVEL CHANGE, reorg must know that there is no further updating (so that this is indeed the last iteration). It therefore drains writers before the last iteration. It can optionally drain readers as well if the DRAIN keyword is used.

In the SWITCH phase, reorg renames the datasets so that the shadow tablespace and indexes become the new 'real' copies. No application process can have access during this phase.

Additionally, when a single partition of a partitioned tablespace is reorganised, rows in the partition are unavailable during the BUILD2 phase.

The result of this is that the utility must be able to get through these phases quickly enough so as not to impact application processes, and application processes must release locks with sufficient frequency so that the reorg does not time out trying to drain work.

Online Reorg is therefore best run at a time of low update and read activity.

Applications should commit frequently (even if only reading). Note that cursors declared WITH HOLD will hold locks across a commit so will not allow online reorg to complete a drain.

The bind parameter RELEASE COMMIT|DEALLOCATE does not seem to have any effect on this.

Plus Points

Availability

Availability is increased because updating or reading can take place during most of the reorg.

Also, if the reorg fails, the old tablespace and indexes still exist and do not need to be recovered.

Minus Points

Managing the mapping table and index

When SHRLEVEL CHANGE is used, a mapping table is required. This either needs to be created with a unique name for each reorg, or a set of mapping tables needs to be maintained and allocated to each reorg as it runs. The former seems to be the more flexible, but there is a possible issue of catalog contention as the table and index spaces are created and deleted (perhaps thousands a night). This can only be determined by stress-testing.

Extra Dasd

Because the original tablespace and indexes and the shadow ones coexist during the running of the reorg, there must be extra free space in the DASD pool at this time. The amount of space needed will depend on the number of concurrent reorgs and the size of the objects being reorganised.

This is a particular issue when reorging non-partitioning indexes (NPIs). Online reorg creates a non-persistent shadow dataset for the logical partition. The problem is that DB2 has no idea what the primary quantity of the index should be, so gives it the primary quantity of the whole NPI. Imagine you have an NPI with a primary quantity of two gig. If you run 10 online reorgs concurrently, each will allocate a 2 gig shadow for the NPI - a total of 20 gig for a 2 gig dataset.

Some space is also required for the mapping table's tablespace and indexes. Apparently the mapping table does not get populated - only the index does. The table can therefore have minimal allocation. The index size needs to be 32 bytes * (number of rows + overflow) where overflow is the sum of NEARINDREF and FARINDREF from the DB2 catalog table, SYSIBM.SYSTABLEPART (this is a count of how many variable length rows couldn't fit into their original location after being updated).

Larger Image Copies

An inline image copy is taken as the shadow tablespace is loaded. At the start of the last iteration of log processing, an incremental image copy is appended to it, and after the last log iteration, another incremental image copy is taken. The increase in size is small, but if there were substantial amounts of updates during the reorg, the image copy could be twice as big. This, however, is unlikely to be a problem.

BUILD2 phase for partition reorgs

As already mentioned, the BUILD2 phase causes data unavailability during the last stages of the reorg.

Increased Complexity

It just used to be a case of 'don't run other stuff during your reorgs'. Now things are more complicated - some stuff will run OK, some will run OK most of the time, some will cause problems. Online reorg compatibility needs to be taken into account during program coding.

V6 Changes

Automatic -TERM UTILITY

If REORG abends during the SWITCH or BUILD2 phases, objects are left in UTUT or UTRO status. A -TERM UTILITY command is required to clear this. A new keyphrase, TIMEOUT TERM, causes the utility to be automatically TERMED in thiese circumstances.We have this retrofitted into V5.

More Efficient Build of Mapping Table Index

In V5, the index on the mapping table is built during the UNLOAD and RELOAD phases using insert mode processing. In V6, the index is built during the BUILD phase using load mode processing. This happens in parallel with the build of the target indexes. The improvement in elapsed time is most marked for poorly organised tables.

Index Build Parallelism

As with all V6 reorgs, runtime is decreased due to index rebuilds being able to run in parallel.