CS346 - Spring 2011
Database System Implementation

RedBase Part 2: The Indexing Component Due Sunday April 24

Introduction

The second part of the RedBase system you will implement is the Indexing (IX) component. The IX component provides classes and methods for managing persistent indexes over unordered data records stored in paged files. Each data file may have any number of (single-attribute) indexes associated with it. The indexes ultimately will be used to speed up processing of relational selections, joins, and condition-based update and delete operations. Like the data records themselves, the indexes are stored in paged files. Hence, in implementing the IX component you will use the PF component similarly to the way you used it for Part 1. In the overall RedBase architecture, you can think of the IX and RM components as sitting side by side above the PF component.

The indexing technique you will implement in the IX component is B+ trees. B+ trees will be reviewed in class, they were covered in CS245, and they are discussed in detail in most comprehensive database textbooks. Because a "perfect" implementation of B+ trees turn out to be quite complex, we are allowing some simplifications as discussed in the Implementation Details section below.

All class names, return codes, constants, etc. in this component should begin with the prefix IX. Each B+ tree index can be stored in one paged file from the PF component. Some specific implementation suggestions are given later in this document, but you should be aware that we're giving away fewer details for this component than we did for the RM component.

Note: You can certainly find pseudocode and perhaps even software packages for B+ trees available publicly. In fact, a previous CS346 TA, based on his work for the class, wrote a paper specifying B+ tree deletion algorithms (pdf). You are welcome to use anything you find, as long as you provide proper acknowledgment when you turn in this part of the project. However, we do warn against simply copying available code and then trying to modify it to fit the RedBase specification. That approach is very likely to be more difficult than using available code or pseudocode for algorithmic ideas and reference.

IX Interface

The IX interface you will implement consists of three classes: the IX_Manager class, the IX_IndexHandle class, and the IX_IndexScan class. In addition, there is an IX_PrintError routine for printing messages associated with nonzero IX return codes. To obtain an initial header file with the public method declarations for this interface (along with links for some other files mentioned below) run the setup script described in the RedBase Logistics document with argument "2" (for project part 2). As usual, all IX component public methods (except constructors and destructors) should return 0 if they complete normally and a nonzero return code otherwise.

IX_Manager Class

The IX_Manager class handles the creation, deletion, opening, and closing of indexes. Your program should create exactly one instance of this class. All necessary initialization of the IX component should take place within the constructor for the IX_Manager class. Note that this constructor takes as a parameter the instance of the PF_Manager class, which you should already have created (refer to the PF and RM documents). Any necessary clean-up in the IX component should take place within the destructor for the IX_Manager class.

class IX_Manager {
  public:
       IX_Manager   (PF_Manager &pfm);              // Constructor
       ~IX_Manager  ();                             // Destructor
    RC CreateIndex  (const char *fileName,          // Create new index
                     int        indexNo,
                     AttrType   attrType,
                     int        attrLength);
    RC DestroyIndex (const char *fileName,          // Destroy index
                     int        indexNo);
    RC OpenIndex    (const char *fileName,          // Open index
                     int        indexNo,
                     IX_IndexHandle &indexHandle);
    RC CloseIndex   (IX_IndexHandle &indexHandle);  // Close index
};

RC CreateIndex (const char *fileName, int indexNo, AttrType attrType, int attrLength)

This method creates an index numbered indexNo on the data file named fileName. You may assume that clients of this method will ensure that the indexNo parameter is unique and nonnegative for each index created on a file. Thus, indexNo can be used along with fileName to generate a unique file name (e.g., "fileName.indexNo") that you can use for the PF component file storing the new index. The type and length of the attribute being indexed are described by parameters attrType and attrLength, respectively. As in the RM component, attrLength should be 4 for attribute types INT or FLOAT, and it should be between 1 and MAXSTRINGLEN for attribute type STRING. This method should establish an empty index by creating the PF component file and initializing it appropriately.

RC DestroyIndex (const char *fileName, int indexNo)

This method should destroy the index numbered indexNo on the data file named fileName by destroying the PF component file used to store the index.

RC OpenIndex (const char *fileName, int indexNo, IX_IndexHandle &indexHandle)

This method should open the index numbered indexNo on the data file named fileName by opening the PF component file used to store the index. If the method is successful, the indexHandle object should become a handle for the open index. The index handle is used to insert into and delete entries from the index (see the IX_IndexHandle methods below), and it can be passed into an IX_IndexScan constructor (see below) for performing a scan using the index. As with RM component files, clients should be able to open an index more than once for reading using a different indexHandle object each time. However, you may make the assumption (without checking it) that if a client is modifying an index, then no other clients are using an indexHandle to read or modify that index.

RC CloseIndex (IX_IndexHandle &indexHandle)

This method should close the open index referred to by indexHandle by closing the PF component file used to store the index.

IX_IndexHandle Class

The IX_IndexHandle class is used to insert and delete index entries, and to force pages of an index's files to disk. To perform these operations, a client first creates an instance of this class and passes it to the IX_Manager::OpenIndex method described above.

class IX_IndexHandle {
  public:
       IX_IndexHandle  ();                             // Constructor
       ~IX_IndexHandle ();                             // Destructor
    RC InsertEntry     (void *pData, const RID &rid);  // Insert new index entry
    RC DeleteEntry     (void *pData, const RID &rid);  // Delete index entry
    RC ForcePages      ();                             // Copy index to disk
 };

RC InsertEntry (void *pData, const RID &rid)

For this and the following two methods, it is incorrect if the IX_IndexHandle object for which the method is called does not refer to an open index. This method should insert a new entry into the index associated with IX_IndexHandle. Parameter pData points to the attribute value to be inserted into the index, and parameter rid identifies the record with that value to be added to the index. Hence, this method effectively inserts an entry for the pair (*pData,rid) into the index. (The index should contain only the record's RID, not the record itself.) If the indexed attribute is a character string of length n, then you may assume that *pData is exactly n bytes long; similarly for parameter *pData in the next method. This method should return a nonzero code if there is already an entry for (*pData,rid) in the index.

RC DeleteEntry (void *pData, const RID &rid)

This method should delete the entry for the (*pData,rid) pair from the index associated with IX_IndexHandle. Although clients of the IX Component typically will ensure that DeleteEntry is not called for entries that are not in the index, for debugging purposes you may want to return a (positive) error code if such a call is made.

RC ForcePages ()

This method should copy to disk all pages associated with the IX_IndexHandle. The index page contents are forced to disk by calling PF_FileHandle::ForcePages for the index file.

IX_IndexScan Class

The IX_IndexScan class is used to perform condition-based scans over the entries of an index.

class IX_IndexScan {
  public:
       IX_IndexScan  ();                                 // Constructor
       ~IX_IndexScan ();                                 // Destructor
    RC OpenScan      (const IX_IndexHandle &indexHandle, // Initialize index scan
                      CompOp      compOp,
                      void        *value,
                      ClientHint  pinHint = NO_HINT);           
    RC GetNextEntry  (RID &rid);                         // Get next matching entry
    RC CloseScan     ();                                 // Terminate index scan
 };

RC OpenScan (const IX_IndexHandle &indexHandle, CompOp compOp, void *value, ClientHint pinHint = NO_HINT)

This method should initialize a condition-based scan over the entries in the open index referred to by parameter indexHandle. Once underway, the scan should produce the RIDs of all records whose indexed attribute value compares in the specified way with the specified value. Parameters compOp and value are exactly as in the RM_FileScan::OpenScan method (including the possibility that compOp=NO_OP and value is a null pointer, indicating a complete scan); please refer to the RM Component document for details. The only exception is that for B+ tree scans, you may choose to disallow comparison operator NE_OP (not-equal). You will need to cast parameter value into the appropriate type for the attribute (or, in the case of an integer or float, copy it into a separate variable to avoid alignment problems), as in the RM component. Also as in method IX_IndexHandle::InsertEntry, if the indexed attribute is a character string of length n, then you may assume that value is exactly n bytes long. As in RM component file scans, optional parameter pinHint is included so that higher-level RedBase components using an IX component index scan can suggest a specific page-pinning strategy for the IX component to use during the index scan, to achieve maximum efficiency. Exploiting this parameter, either now or later, is entirely optional.

RC GetNextEntry (RID &rid)

This method should set output parameter rid to be the RID of the next record in the index scan. This method should return IX_EOF (a positive return code that you define) if there are no index entries left satisfying the scan condition. You may assume that IX component clients will not close the corresponding open index while a scan is underway.

RC CloseScan ()

This method should terminate the index scan.

IX_PrintError

void IX_PrintError (RC rc);

This routine should write a message associated with the nonzero IX return code rc onto the Unix stderr output stream. This routine has no return value.

Implementation Details

As with Part 1, you are free to use alternative design ideas to those suggested here if you believe your ideas will improve the structure or performance of your code. The only thing that you must not alter is the interface itself, although you are free to extend it either now or as the project progresses.
Each node of a B+ tree can be stored in one page of the corresponding PF file. Logically, index entries have the form (attrValue,rid), indicating that in the data file being indexed there is a record with RID rid whose value for the indexed attribute is attrValue. In one straightforward design the leaves of the tree are structurally identical to internal nodes, i.e., they contain attribute values and page pointers (numbers), and there is a separate "bucket" page (pointed to from the leaf nodes) for each attribute value, containing the list of RIDs for that value. Note that although this straightforward design is certainly acceptable for your project, it is very inefficient when the number of RIDs for each attribute value is low -- buckets will be nearly empty. There are numerous improvements or alternatives to this straightforward design.
- Simplification: You do not need to handle the case where the number of RIDs for a single attribute value is so large that all RIDs for that value cannot be stored on one page, but you should detect if such an overflow occurs and generate a nonzero return code.
- Small amount of extra credit: Accommodate any number of RIDs for an attribute value, typically through bucket chaining.
The three fundamental B+ tree operations -- search (which extends to scan), insertion, and deletion -- vary quite a bit in their implementation complexity. We suggest that you get search and insertion running first, and then worry about deletion. In fact, implementing a completely correct delete operation in B+ trees turns out to be quite difficult.
- Simplification: You may implement lazy deletion. In this approach, when an entry is deleted, even if it causes a leaf page to become less than half full no redistribution or node merging takes place -- the underfull page remains in the tree. When a leaf page becomes empty the node is removed from the tree and an entry is removed from its parent, but again no redistribution or node merging takes place.
- Small amount of credit deducted: Even simpler than lazy deletion is tombstones. In this approach, when an entry is deleted it is replaced by a special marker indicating an empty slot (which may be reused later). Tree nodes are never deleted or merged, although empty buckets should be deleted. You cannot receive full credit on the IX component if you implement tombstones, but it is an option to consider if you're pressed for time.
- Medium amount of extra credit: Implementing fully correct (i.e., rebalancing) deletion is quite complex, with a number of tricky end cases. We will be duly impressed if you do so.
Regardless of which approach you use, deletion must work: once an IX component client asks for an entry to be deleted, that entry should never appear in a subsequent index scan.
We strongly suggest that you implement your B+ tree operations using recursive algorithms. Although you may find it somewhat difficult to understand these algorithms at first, using a recursive approach will greatly simplify your coding task.
Index scans will be used by higher-level components when executing selection, join, and delete operations, as well as update operations on attributes other than the indexed attribute. Thus, each index entry scanned will either be used to fetch (and possibly update) a record, or to delete a record. (The insert operation inserts one record at a time.) While making an index scan work correctly for selection, join, and non-index-key update operations is relatively straightforward, deletion operations are more complicated, even when using the simplified approaches to deletion described above. You must ensure that it is possible to use an index scan to find and then delete all index entries satisfying a condition. That is, the following client code segment should work:
```
   IX_IndexScan scan;
   scan.OpenScan(indexHandle, ...)
   while ((rc = scan.GetNextEntry(rid)) != IX_EOF) {
      error checking;
      delete record;
      // attrValue is value of indexed attribute
      indexHandle.DeleteEntry(attrValue, rid); }
```
Depending on your design, which simplifications you make, and how you manage deletions, making sure this code will work may require a varying amount of effort. However, it will enable your query processor (Part 4) to efficiently perform delete operations based on indexed attributes.
You may assume that during a retrieval scan (for selection, join, and update operations) no index entries will be inserted or deleted. You also may assume that during a deletion scan, no other index records will be inserted or deleted, and no retrieval scans will be underway. In the basic RedBase system, IX clients will ensure that these types of conflicting operations/scans can never occur.
Depending on your design, it may require some extra effort for scans to always return RIDs such that the corresponding attribute values are in increasing (actually nondecreasing) order. This property is not required in the project, however you may find it useful later on, e.g., if you decide to use index scans to produce a sorted relation. Note that one of the provided tests does exploit this property, but passing this test is not required.

Miscellaneous

As with the RM component, for the IX component you may find it convenient to use an internal header file (ix_internal.h, say), along with the external header file ix.h and the global header file redbase.h.
Return codes and error handling should continue in the style you adopted for the RM component.
You are again expected to include comments in your code, and to submit a 1-2 page description (in a plain text file ix_DOC) covering your design, key data structures, testing strategy, and known bugs, and citing any assistance you received in design or debugging. Since the specification and implementation suggestions are somewhat looser for this component than they were for Part 1, please be sure to explain any high-level algorithmic decisions you needed to make.
As in the RM component, after you submit your project part you will receive an email message from the TA asking you to describe your design decisions and implementation strategy for specific parts of the IX component. You will be expected to respond within 48 hours with a short message and pointers to the relevant code.
Running the setup script for this component copies an ix_test.cc program, but as always the tests are not comprehensive and you will need to create more tests in order to thoroughly exercise your code. We will use the same testing shell with our (surprise) tests in order to check your component for correctness. IX component tests from students in previous years are in the test repository /usr/class/cs346/redbase/testers/. Note that some of these tests were designed for hash indexes, which only support the equality comparison operator, so the interface for IX_IndexScan::OpenScan is somewhat different and those tests would need to be modified for this year's project.
If you are aiming to win the RedBase Efficiency Contest then it is again very important to consider (and measure) I/O efficiency as you develop your code.
Part 2 should be submitted electronically in the same way you submitted Part 1. Please compile using the -DPF_STATS flag, remember to run the "submit -c" script with argument "2" (for project part 2), and be sure to check the file submit.ix before issuing the final "submit -s 2" command. Similar to Part 1, the TA test program will link with libix.a which should be the result of running "make" from within your submission directory, and everything must link correctly if the test program calls methods in ix.h.