Mainframes 360
The one stop destination for System Z professionals

Friday, March 7, 2014

DB2 Basics : Indexes

DB2 Basics : Indexes

An index is a data-structure that exists physically on the disk. The index is usually defined on one or more columns of a table. It has the ordered column and pointers to rows in the table. Indexes are the fastest way to access DB2 data. Indexes reduce search-time.

Fig. Searching a value k in the Index

An index is essentially a B+ Tree. It has a root page, internal nodes and leaf pages. Each node has records containing two fields - the key and pointer(ROWID) to other nodes. The leaf nodes have no children; instead they house keys and pointers to data pages. B+ Trees are used in most filesystems and relational database products like IBM DB2, MS-SQL Server, Oracle, Sybase etc. to create indexes on data for efficient retrieval.

Let's say we are searching for a value k in the B+ Tree. Starting at the root page, we'll walk our way through the index and reach a leaf containing the value k(assume k=28). At each node, we determine which internal pointer we should follow. An internal B+ Tree node has at most b(in the above example b=3) children.

Imagine every node represent a range e.g. the root node represents 1<=key<=100, the blue node stands for 1<=key<=20, the yellow 21<=key<=40 and so forth. We compare k with the root node records. If k < 20, search the blue node. If 21<=k<=40, search the yellow node. If 41<=k<=60, search the green node. We follow the pointer to the yellow-node and the above steps applied recursively. Searching a tree with 3 leaf nodes takes 1 comparision, with 9 leaf nodes, takes 2 comparisions, with 27 leaf nodes takes 3 comparisions, with n leaf nodes requires log3 n comparisions. An index with b branches would require logb n comparisions. Searching a value in index requires logarithmic time.

Inserts and Page Splits

On index creation, LOADs and REORGs, DB2 reserves empty space for future INSERTs on every leaf page. PCTFREE, a percentage of the page-size reserves space for row overflows on all leaf pages. Say, PCTFREE=10 of a 4K page books 400 bytes on every leaf. As time goes by, INSERTs add entries whereas DELETE pseudo-deletes the index entries and leaves holes. A few index holes are good, into which newly inserted rows can go. When an INSERT happens, DB2 finds the candidate page affected. Based on the newly INSERTed row's key, an index hole may be reused, otherwise its added to the overflow area. If there isn't any room for an additional row to fit, DB2 will split the page into two equal groups. We call this a page split.

Clustering sequence and CLUSTERRATIO

A clustered index informs DB2 of the physical order(according to index key values) in which to arrange the rows of a table. On REORG, DB2 would sort the rows in an increasing order. Think about it! If you define a clustered index on DEPT_NO column of EMPLOYEES table, DB2 would have all the rows for the same department close together. Fewer reads would be required to access all rows for a given department.

Indexing is an art that takes time to master. Here are a few guidelines on creating an index :
  • Consider indexing on columns used in WHERE, GROUP BY, ORDER BY UNION and DISTINCT.
  • When you index a table, explicitly specify CLUSTER option on CREATE INDEX. Failure to do so, causes DB2 to cluster the data on the first index defined.
  • The first column of a multi-column index must be wisely chosen. I am using this analogy, where frequency can be thought of as temperature. Rarely accessed rows are called cold rows. Say, a CUSTOMERS_HISTORY database is not date ordered(UPD_TS is not indexed), and old rows are not deleted. To begin with, the hot rows can be spread out over two to three pages. Over time, the average page would contain fewer and fewer hot rows and more cold rows. As application programs are requesting the hot rows, these set of rows are now spread out over ten to twelve pages, instead of two to three. Voila! The number of GETPAGEs go up and so does the application CPU time. Therefore, choose the most referenced column in SQL as the first index column.

CLUSTERRATIO is the fraction of data rows in clustering sequence. If you do a RUNSTATS and see a CLUSTERRATIO of less than 96%, it implies there are performance problems and you should REORG.

INSYNC EDIT SYSIBM.SYSINDEXES                            ROW 1 OF 6 COLS 1 - 72 
COMMAND ===> SCROLL ===> CSR
NAME TBNAME CLUSTERRATIO FIRSTKEYCARD FULLKEYCARD NLEAF
CHAR(8) CHAR(10) SMALLINT INTEGER INTEGER INTEGER
-------- ---------- ------------ ------------ ----------- ---------
****** ************************** TOP OF DATA *********************************

000001 IDXEMP01 EMPLOYEES 100 171 171 2
000002 IDXEMP02 EMPLOYEES 100 92 167 11
000003 IDXEMP03 EMPLOYEES 100 25 171 8
000004 IDXEMP04 EMPLOYEES 100 171 171 2
000005 IDXEMP05 EMPLOYEES 100 4 167 14
000006 IDXEMP06 EMPLOYEES 100 25 31 5
****** ************************* BOTTOM OF DATA *******************************









F1=HELP F2=SPLIT F3=END F4=RETURN F5=RFIND F6=RCHANGE F7=UP F8=DOWN F9=SWAP
F10=LEFT F11=RIGHT F12=RETRIEVE

Fig. CLUSTERRATIO in SYSIBM.SYSINDEXES

To many people who are thrown to work at a mainframe computer on their first job, they feel lost. Mainframe people seem to speak a completely different language and that doesn't make life easy. What's more, the books and manuals are incredibly hard to comprehend.

"What on earth is a Mainframe?" is an absolute beginner's guide to mainframe computers. We'll introduce you to the hardware and peripherals. We'll talk about the operating system, the software installed on a mainframe. We'll also talk about the different people who work on a mainframe. In a nutshell, we'll de-mystify the mainframe.

Readers based in India, can buy the e-book for Rs. 50 only or the print book. International readers based in the US and other countries can click here to purchase the e-book.