CS346 - Spring 2011
Database System Implementation
RedBase Part 2: The Indexing Component
Due Sunday April 24
|
The second part of the RedBase system you will implement is the
Indexing (IX) component. The IX component provides
classes and methods for managing persistent indexes over unordered
data records stored in paged files. Each data file may have any
number of (single-attribute) indexes associated with it. The indexes
ultimately will be used to speed up processing of relational
selections, joins, and condition-based update and delete operations.
Like the data records themselves, the indexes are stored in paged
files. Hence, in implementing the IX component you will use the PF
component similarly to the way you used it for Part 1. In the overall
RedBase architecture, you can think of the IX and RM components as
sitting side by side above the PF component.
The indexing technique you will implement in the IX component is B+
trees. B+ trees will be reviewed in class, they were covered in
CS245, and they are discussed in detail in most comprehensive database
textbooks. Because a "perfect" implementation of B+ trees turn out to
be quite complex, we are allowing some simplifications as discussed in
the Implementation Details section below.
All class names, return codes, constants, etc. in this component
should begin with the prefix IX. Each B+ tree index can be stored in
one paged file from the PF component. Some specific implementation
suggestions are given later in this document, but
you should be aware that we're giving away fewer details for this
component than we did for the RM component.
Note: You can certainly find pseudocode and perhaps even software
packages for B+ trees available publicly. In fact, a previous CS346
TA, based on his work for the class, wrote a paper specifying B+ tree deletion algorithms
(pdf). You are welcome to use anything you find, as long as you
provide proper acknowledgment when you turn in this part of the
project. However, we do warn against simply copying available code
and then trying to modify it to fit the RedBase specification. That
approach is very likely to be more difficult than using available code
or pseudocode for algorithmic ideas and reference.
The IX interface you will implement consists of three classes: the
IX_Manager class, the IX_IndexHandle class, and the
IX_IndexScan class. In addition, there is an
IX_PrintError routine for printing messages associated with
nonzero IX return codes. To obtain an initial header file with the
public method declarations for this interface (along with links for
some other files mentioned below) run the setup script
described in the RedBase
Logistics document with argument "2" (for project part
2). As usual, all IX component public methods (except constructors
and destructors) should return 0 if they complete normally and a
nonzero return code otherwise.
IX_Manager Class
The IX_Manager class handles the creation, deletion, opening,
and closing of indexes. Your program should create exactly one
instance of this class. All necessary initialization of the IX
component should take place within the constructor for the
IX_Manager class. Note that this constructor takes as a
parameter the instance of the PF_Manager class, which you
should already have created (refer to the PF and RM
documents). Any necessary clean-up in the IX component should take
place within the destructor for the IX_Manager class.
class IX_Manager {
public:
IX_Manager (PF_Manager &pfm); // Constructor
~IX_Manager (); // Destructor
RC CreateIndex (const char *fileName, // Create new index
int indexNo,
AttrType attrType,
int attrLength);
RC DestroyIndex (const char *fileName, // Destroy index
int indexNo);
RC OpenIndex (const char *fileName, // Open index
int indexNo,
IX_IndexHandle &indexHandle);
RC CloseIndex (IX_IndexHandle &indexHandle); // Close index
};
RC CreateIndex (const char *fileName, int indexNo,
AttrType attrType, int attrLength)
This method creates an index numbered indexNo on the data
file named fileName. You may assume that clients of this
method will ensure that the indexNo parameter is unique and
nonnegative for each index created on a file. Thus, indexNo
can be used along with fileName to generate a unique file
name (e.g., "fileName.indexNo") that you can use for the PF
component file storing the new index. The type and length of the
attribute being indexed are described by parameters attrType
and attrLength, respectively. As in the RM component,
attrLength should be 4 for attribute types INT or
FLOAT, and it should be between 1 and MAXSTRINGLEN
for attribute type STRING. This method should establish an
empty index by creating the PF component file and initializing it
appropriately.
RC DestroyIndex (const char *fileName, int indexNo)
This method should destroy the index numbered indexNo on the
data file named fileName by destroying the PF component file
used to store the index.
RC OpenIndex (const char *fileName, int indexNo, IX_IndexHandle &indexHandle)
This method should open the index numbered indexNo on the
data file named fileName by opening the PF component file
used to store the index. If the method is successful, the
indexHandle object should become a handle for the open index.
The index handle is used to insert into and delete entries from the
index (see the IX_IndexHandle methods below), and it can be
passed into an IX_IndexScan constructor (see below) for
performing a scan using the index. As with RM component files,
clients should be able to open an index more than once for reading
using a different indexHandle object each time. However, you
may make the assumption (without checking it) that if a client is
modifying an index, then no other clients are using an
indexHandle to read or modify that index.
RC CloseIndex (IX_IndexHandle &indexHandle)
This method should close the open index referred to by
indexHandle by closing the PF component file used to store
the index.
IX_IndexHandle Class
The IX_IndexHandle class is used to insert and delete index
entries, and to force pages of an index's files to disk. To perform
these operations, a client first creates an instance of this class and
passes it to the IX_Manager::OpenIndex method described
above.
class IX_IndexHandle {
public:
IX_IndexHandle (); // Constructor
~IX_IndexHandle (); // Destructor
RC InsertEntry (void *pData, const RID &rid); // Insert new index entry
RC DeleteEntry (void *pData, const RID &rid); // Delete index entry
RC ForcePages (); // Copy index to disk
};
RC InsertEntry (void *pData, const RID &rid)
For this and the following two methods, it is incorrect if the
IX_IndexHandle object for which the method is called does not
refer to an open index. This method should insert a new entry into
the index associated with IX_IndexHandle. Parameter
pData points to the attribute value to be inserted into the
index, and parameter rid identifies the record with that
value to be added to the index. Hence, this method effectively
inserts an entry for the pair (*pData,rid) into the index.
(The index should contain only the record's RID, not the record
itself.) If the indexed attribute is a character string of length
n, then you may assume that *pData is exactly n
bytes long; similarly for parameter *pData in the next
method. This method should return a nonzero code if there is already
an entry for (*pData,rid) in the index.
RC DeleteEntry (void *pData, const RID &rid)
This method should delete the entry for the (*pData,rid) pair
from the index associated with IX_IndexHandle. Although
clients of the IX Component typically will ensure that
DeleteEntry is not called for entries that are not in the
index, for debugging purposes you may want to return a (positive)
error code if such a call is made.
RC ForcePages ()
This method should copy to disk all pages associated with the
IX_IndexHandle. The index page contents are forced to disk
by calling PF_FileHandle::ForcePages for the index file.
IX_IndexScan Class
The IX_IndexScan class is used to perform condition-based scans
over the entries of an index.
class IX_IndexScan {
public:
IX_IndexScan (); // Constructor
~IX_IndexScan (); // Destructor
RC OpenScan (const IX_IndexHandle &indexHandle, // Initialize index scan
CompOp compOp,
void *value,
ClientHint pinHint = NO_HINT);
RC GetNextEntry (RID &rid); // Get next matching entry
RC CloseScan (); // Terminate index scan
};
RC OpenScan (const IX_IndexHandle &indexHandle, CompOp compOp, void *value, ClientHint pinHint = NO_HINT)
This method should initialize a condition-based scan over the entries
in the open index referred to by parameter indexHandle. Once
underway, the scan should produce the RIDs of all records whose
indexed attribute value compares in the specified way with the
specified value. Parameters compOp and value are
exactly as in the RM_FileScan::OpenScan method (including the
possibility that compOp=NO_OP and value is a null
pointer, indicating a complete scan); please refer to the RM
Component document for details. The only exception is that for B+
tree scans, you may choose to disallow comparison operator
NE_OP (not-equal). You will need to cast parameter
value into the appropriate type for the attribute (or, in the
case of an integer or float, copy it into a separate variable to avoid
alignment problems), as in the RM component. Also as in method
IX_IndexHandle::InsertEntry, if the indexed attribute is a
character string of length n, then you may assume that
value is exactly n bytes long. As in RM component file
scans, optional parameter pinHint is included so that
higher-level RedBase components using an IX component index scan can
suggest a specific page-pinning strategy for the IX component to use
during the index scan, to achieve maximum efficiency. Exploiting this
parameter, either now or later, is entirely optional.
RC GetNextEntry (RID &rid)
This method should set output parameter rid to be the RID of
the next record in the index scan. This method should return
IX_EOF (a positive return code that you define) if there are
no index entries left satisfying the scan condition. You may assume
that IX component clients will not close the corresponding open index
while a scan is underway.
RC CloseScan ()
This method should terminate the index scan.
IX_PrintError
void IX_PrintError (RC rc);
This routine should write a message associated with the nonzero IX
return code rc onto the Unix stderr output stream.
This routine has no return value.
- As with Part 1, you are free to use alternative design ideas to
those suggested here if you believe your ideas will improve the
structure or performance of your code. The only thing that you must
not alter is the interface itself, although you are free to extend it
either now or as the project progresses.
- Each node of a B+ tree can be stored in one page of the
corresponding PF file. Logically, index entries have the form
(attrValue,rid), indicating that in the data file being
indexed there is a record with RID rid whose value for the
indexed attribute is attrValue. In one straightforward
design the leaves of the tree are structurally identical to internal
nodes, i.e., they contain attribute values and page pointers
(numbers), and there is a separate "bucket" page (pointed to from the
leaf nodes) for each attribute value, containing the list of RIDs for
that value. Note that although this straightforward design is
certainly acceptable for your project, it is very inefficient when the
number of RIDs for each attribute value is low -- buckets will be
nearly empty. There are numerous improvements or alternatives to this
straightforward design.
- Simplification: You do not need to handle the case
where the number of RIDs for a single attribute value is so large that
all RIDs for that value cannot be stored on one page, but you should
detect if such an overflow occurs and generate a nonzero return
code.
- Small amount of extra credit: Accommodate any number of
RIDs for an attribute value, typically through bucket chaining.
- The three fundamental B+ tree operations -- search
(which extends to scan), insertion, and deletion -- vary
quite a bit in their implementation complexity. We suggest that you
get search and insertion running first, and then worry about deletion.
In fact, implementing a completely correct delete operation in B+
trees turns out to be quite difficult.
- Simplification: You may implement lazy deletion. In
this approach, when an entry is deleted, even if it causes a leaf page
to become less than half full no redistribution or node merging takes
place -- the underfull page remains in the tree. When a leaf page
becomes empty the node is removed from the tree and an entry is
removed from its parent, but again no redistribution or node merging
takes place.
- Small amount of credit deducted: Even simpler than lazy
deletion is tombstones. In this approach, when an entry is
deleted it is replaced by a special marker indicating an empty slot
(which may be reused later). Tree nodes are never deleted or merged,
although empty buckets should be deleted. You cannot receive full
credit on the IX component if you implement tombstones, but it is an
option to consider if you're pressed for time.
- Medium amount of extra credit: Implementing fully correct
(i.e., rebalancing) deletion is quite complex, with a number of tricky
end cases. We will be duly impressed if you do so.
Regardless of which approach you use, deletion must work: once an IX
component client asks for an entry to be deleted, that entry should
never appear in a subsequent index scan.
- We strongly suggest that you implement your B+ tree operations
using recursive algorithms. Although you may find it somewhat
difficult to understand these algorithms at first, using a recursive
approach will greatly simplify your coding task.
- Index scans will be used by higher-level components when
executing selection, join, and delete operations, as well as update
operations on attributes other than the indexed attribute. Thus, each
index entry scanned will either be used to fetch (and possibly update)
a record, or to delete a record. (The insert operation
inserts one record at a time.) While making an index scan work
correctly for selection, join, and non-index-key update operations is
relatively straightforward, deletion operations are more complicated,
even when using the simplified approaches to deletion described above.
You must ensure that it is possible to use an index scan to find and
then delete all index entries satisfying a condition. That is, the
following client code segment should work:
IX_IndexScan scan;
scan.OpenScan(indexHandle, ...)
while ((rc = scan.GetNextEntry(rid)) != IX_EOF) {
error checking;
delete record;
// attrValue is value of indexed attribute
indexHandle.DeleteEntry(attrValue, rid); }
Depending on your design, which simplifications you make, and how you
manage deletions, making sure this code will work may require a
varying amount of effort. However, it will enable your query
processor (Part 4) to efficiently perform delete operations
based on indexed attributes.
You may assume that during a retrieval scan (for selection, join,
and update operations) no index entries will be inserted or deleted.
You also may assume that during a deletion scan, no other index
records will be inserted or deleted, and no retrieval scans will be
underway. In the basic RedBase system, IX clients will ensure that
these types of conflicting operations/scans can never occur.
- Depending on your design, it may require some extra effort for
scans to always return RIDs such that the corresponding attribute
values are in increasing (actually nondecreasing) order. This property
is not required in the project, however you may find it useful
later on, e.g., if you decide to use index scans to produce a sorted
relation. Note that one of the provided tests does exploit this
property, but passing this test is not required.
- As with the RM component, for the IX component you may find it
convenient to use an internal header file (ix_internal.h,
say), along with the external header file ix.h and the global
header file redbase.h.
- Return codes and error handling should continue in the style
you adopted for the RM component.
- You are again expected to include comments in your code, and
to submit a 1-2 page description (in a plain text file
ix_DOC) covering your design, key data structures, testing
strategy, and known bugs, and citing any assistance you received in
design or debugging. Since the specification and implementation
suggestions are somewhat looser for this component than they were for
Part 1, please be sure to explain any high-level algorithmic decisions
you needed to make.
- As in the RM component, after you submit your project part you
will receive an email message from the TA asking you to describe
your design decisions and implementation strategy for specific parts
of the IX component. You will be expected to respond within 48 hours
with a short message and pointers to the relevant code.
- Running the setup script for this component copies an
ix_test.cc program, but as always the tests are not
comprehensive and you will need to create more tests in order to
thoroughly exercise your code. We will use the same testing shell
with our (surprise) tests in order to check your component for
correctness. IX component tests from students in previous years are
in the test repository /usr/class/cs346/redbase/testers/.
Note that some of these tests were designed for hash indexes, which
only support the equality comparison operator, so the interface for
IX_IndexScan::OpenScan is somewhat different and those tests
would need to be modified for this year's project.
- If you are aiming to win the RedBase Efficiency Contest then it
is again very important to consider (and measure) I/O efficiency as
you develop your code.
- Part 2 should be submitted electronically in the same way you
submitted Part 1. Please compile using the -DPF_STATS flag,
remember to run the "submit -c" script with argument
"2" (for project part 2), and be sure to check the file
submit.ix before issuing the final "submit -s 2"
command. Similar to Part 1, the TA test program will link with
libix.a
which should be the result of running
"make
" from within your submission directory, and
everything must link correctly if the test program calls methods in
ix.h
.