CS346 - Spring 2011
Database System Implementation

RedBase: The Paged File Component

Introduction
We will provide code for the "bottom" component of the RedBase system, the Paged File (PF) component. This component provides facilities for higher-level client components to perform file I/O in terms of pages. In the PF component, methods are provided to create, destroy, open, and close paged files, to scan through the pages of a given file, to read a specific page of a given file, to add and delete pages of a given file, and to obtain and release pages for scratch use. To get started using the PF component, consult the
RedBase Logistics document.

The C++ interface for the PF component is provided below. The name of each class begins with the prefix PF -- you will follow a similar naming convention for your components of the system. Each method in the PF component except constructors and destructors returns an integer code; the same will be true of all of the methods you will write. A return code of 0 indicates normal completion. A nonzero return code indicates that an exception condition or error has occurred. Positive nonzero return codes indicate non-error exception conditions (such as reaching the end of a file) or errors from which the system can recover or exit gracefully (such as trying to close an unopened file). Negative nonzero return codes indicate errors from which the system cannot recover. PF return codes and error handling are described below.

The Buffer Pool of Pages
Accessing data on a page of a file requires first reading the page into a buffer pool in main memory, then manipulating (reading or writing) its data there. While a page is in memory and its data is available for manipulation, the page is said to be "pinned" in the buffer pool. A pinned page remains in the buffer pool until it is explicitly "unpinned." A client unpins a page when it is done manipulating the data on that page. Unpinning a page does not necessarily cause the page to be removed from the buffer -- an unpinned page is kept in memory as long as its space in the buffer pool is not needed.

If the PF component needs to read a new page into memory and there are no free spaces left in the buffer pool, then the PF component will choose an unpinned page to remove from the buffer pool and will reuse its space. The PF component uses a Least-Recently-Used (LRU) page replacement policy. When a page is removed from the buffer pool, it is copied back to the file on disk if and only if the page is marked as "dirty." Dirty pages are not written to disk automatically until they are removed from the buffer. However, a PF client can always send an explicit request to force (i.e., write to disk) the contents of a particular page, or to force all dirty pages of a file, without removing those pages from the buffer.

It is important not to leave pages pinned in memory unnecessarily. The PF component clients that you will implement can be designed so that each operation assumes none of the pages it needs are in the buffer pool: A client fetches the pages it needs, performs the appropriate actions on them, and then unpins them, even if it thinks a certain page may be needed again in the near future. (If the page is used again soon then it will probably still be in the buffer pool anyway.) The PF component does allow the same page to be pinned more than once, without unpinning it in between. In this case, the page won't actually be unpinned until the number of unpin operations matches the number of pin operations. It is very important that each time you fetch and pin a page, you don't forget to unpin it when you're done. If you fail to unpin pages, the buffer pool will slowly fill up until you can no longer fetch any pages at all (at which point the PF component will return a negative code).

Page Numbers
Pages in a file are identified by page numbers, which correspond to their location within the file on disk. When you initially create a file and allocate pages, page numbering will be sequential. However, once pages have been deleted, the numbers of newly allocated pages are not sequential. The PF component reallocates previously allocated pages using a LIFO (stack) algorithm -- that is it reallocates the most recently deleted (and not reallocated) page. A brand new page is never allocated if a previously allocated page is available.

When you scan through a file by calling the GetFirstPage and GetNextPage methods (described below), you will obtain pages in their numeric order, skipping those pages that were allocated and then deleted, and ending the scan with the largest page number currently valid. Since numeric scan order is guaranteed, and because initial page numbering is sequential, it is possible for clients to implement a policy where the first one or more pages of each file are used for header information.

Page Deallocation
Although the PF component itself deallocates PF file pages, it doesn't "give back" these pages to the underlying Unix file system, because most Unix systems do not have the capability to collapse files and make use of the empty pages. However, you should write your PF clients under the assumption that file collapsing could occur. That is, your code should not need to change if the PF component were modified to do actual file compression after page disposal.

Scratch Pages
Most RedBase implementations store and manipulate all of their data on pages associated with files. Occasionally, students may wish to implement more sophisticated and efficient algorithms that require storing and manipulating pages of data temporarily in "scratch" memory. In a realistic database system setting, scratch memory competes with file pages for buffer pool space, and the same constraints are a
requirement of the RedBase I/O Efficiency Contest. Therefore, the PF component includes methods for allocating and disposing of scratch pages (memory blocks) in the buffer pool. These blocks reside in the buffer pool and are handled by the buffer manager, but they are not associated with a particular file.

Most students will not make use of these methods.

Miscellaneous Notes

PF Interface
The PF interface consists of three classes: the PF_Manager class, the PF_FileHandle class, and the PF_PageHandle class. In addition, there is a PF_PrintError routine for printing messages associated with nonzero PF return codes.

*** PF_Manager Class ***

The PF_Manager class handles the creation, deletion, opening, and closing of paged files, along with the allocation and disposal of scratch pages. Your program should create exactly one instance of this class, and all requests for PF component file management should be directed to that instance. Below, the public methods of the class declaration are shown first, followed by descriptions of the methods. The first two methods in the class declaration are the constructor and destructor methods for the class; they are not explained further. Each method except the constructor and destructor methods returns a value of type RC (for "return code" -- actually an integer). A return code of 0 indicates normal completion. A nonzero return code indicates that an exception condition or error has occurred.
class PF_Manager
{
  public:
       PF_Manager    ();                           // Constructor
       ~PF_Manager   ();                           // Destructor
    RC CreateFile    (const char *fileName);       // Create a new file
    RC DestroyFile   (const char *fileName);       // Destroy a file
    RC OpenFile      (const char *fileName, PF_FileHandle &fileHandle);  
                                                   // Open a file
    RC CloseFile     (PF_FileHandle &fileHandle);  // Close a file
    RC AllocateBlock (char *&buffer);              // Allocate a new scratch page in buffer
    RC DisposeBlock  (char *buffer);               // Dispose of a scratch page
};

RC CreateFile (const char *fileName)

This method creates a paged file called fileName. The file should not already exist.

RC DestroyFile (const char *fileName)

This method destroys the paged file whose name is fileName. The file should exist.

RC OpenFile (const char *fileName, PF_FileHandle &fileHandle)

This method opens the paged file whose name is fileName. The file must already exist and it must have been created using the CreateFile method. If the method is successful, the fileHandle object whose address is passed as a parameter becomes a "handle" for the open file. The file handle is used to manipulate the pages of the file (see the PF_FileHandle class description below). It is a (positive) error if fileHandle is already a handle for an open file when it is passed to the OpenFile method. It is not an error to open the same file more than once if desired, using a different fileHandle object each time. Each call to the OpenFile method creates a new "instance" of the open file. Warning: Opening a file more than once for data modification is not prevented by the PF component, but doing so is likely to corrupt the file structure and may crash the PF component. Opening a file more than once for reading is no problem.

RC CloseFile (PF_FileHandle &fileHandle)

This method closes the open file instance referred to by fileHandle. The file must have been opened using the OpenFile method. All of the file's pages are flushed from the buffer pool when the file is closed. It is a (positive) error to attempt to close a file when any of its pages are still pinned in the buffer pool.

RC AllocateBlock (char *&buffer)

This method allocates a "scratch" memory page (block) in the buffer pool and sets buffer to point to it. The amount of memory available in the block is PF_PAGE_SIZE + 4 = 4096 bytes. The scratch page is automatically pinned in the buffer pool.

RC DisposeBlock (char *buffer)

This method disposes of the scratch page in the buffer pool pointed to by buffer, which must have been allocated previously by PF_Manager::AllocateBlock. Similar to pinning and unpinning, you must call PF_Manager::DisposeBlock for each buffer block obtained by calling PF_Manager::AllocateBlock; otherwise you will lose pages in the buffer pool permanently.

*** PF_FileHandle Class ***

The PF_FileHandle class provides access to the pages of an open file. To access the pages of a file, a client first creates an instance of this class and passes it to the PF_Manager::OpenFile method described above. As before, the public methods of the class declaration are shown first, followed by descriptions of the methods. The first two methods in the class declaration are the constructor and destructor methods and are not explained further.
class PF_FileHandle {
  public:
       PF_FileHandle  ();                                  // Default constructor
       ~PF_FileHandle ();                                  // Destructor
       PF_FileHandle  (const PF_FileHandle &fileHandle);   // Copy constructor
       PF_FileHandle& operator= (const PF_FileHandle &fileHandle);
                                                           // Overload =
    RC GetFirstPage   (PF_PageHandle &pageHandle) const;   // Get the first page
    RC GetLastPage    (PF_PageHandle &pageHandle) const;   // Get the last page
    
    RC GetNextPage    (PageNum current, PF_PageHandle &pageHandle) const; 
                                                           // Get the next page
    RC GetPrevPage    (PageNum current, PF_PageHandle &pageHandle) const;
                                                           // Get the previous page
    RC GetThisPage    (PageNum pageNum, PF_PageHandle &pageHandle) const;  
                                                           // Get a specific page
    RC AllocatePage   (PF_PageHandle &pageHandle);         // Allocate a new page
    RC DisposePage    (PageNum pageNum);                   // Dispose of a page 
    RC MarkDirty      (PageNum pageNum) const;             // Mark a page as dirty
    RC UnpinPage      (PageNum pageNum) const;             // Unpin a page
    RC ForcePages     (PageNum pageNum = ALL_PAGES) const; // Write dirty page(s)
                                                           //   to disk
 };
Note: The first two methods described below -- the copy constructor and the overloaded = operator -- are somewhat advanced C++ concepts. In general, it is not required to use or implement these methods in RedBase classes, but you are free to do so if you wish. These methods are provided in the PF_FileHandle and PF_PageHandle classes of the PF component for those students who may wish to construct one object from another or assign one object to another.

PF_FileHandle (const PF_FileHandle &fileHandle)

This method is the copy constructor, called if a new file handle object is created from an existing one. When a new file handle object is created from a file handle object that refers to an open file instance, the file is not opened an additional time. Instead, both file handle objects refer to the same open file instance. It is sufficient to call PF_Manager::CloseFile with one of the file handle objects to close the file.

PF_FileHandle& operator= (const PF_FileHandle &fileHandle)

This method overloads the = operator when it is used to assign one file handle object to another. It is not a good idea to assign one file handle object to another if the file handle object on the left-hand side of the = already refers to an open file. As with the copy constructor, if the file handle object on the right-hand side of the = refers to an open file instance, the file is not opened an additional time. Instead, both file handle objects refer to the same open file instance, and it is sufficient to call PF_Manager::CloseFile with one of the file handle objects to close the file.

RC GetFirstPage (PF_PageHandle &pageHandle)

For this and the following methods, it is a (positive) error if the PF_FileHandle object for which the method is called does not refer to an open file. This method reads the first page of the file into the buffer pool in memory. If the page fetch is successful, the pageHandle object becomes a handle for the page. The page handle is used to access the page's contents (see the PF_PageHandle class description below). The page read is automatically pinned in the buffer pool and remains pinned until it is explicitly unpinned by calling the UnpinPage method (below). This method returns the positive code PF_EOF if end-of-file is reached (meaning there is no first page).

RC GetLastPage (PF_PageHandle &pageHandle)

This method reads the last page of the file into the buffer pool in memory. If the page fetch is successful, the pageHandle object becomes a handle for the page. The page read is automatically pinned in the buffer pool and remains pinned until it is explicitly unpinned by calling the UnpinPage method (below). This method returns the positive code PF_EOF if end-of-file is reached (meaning there is no last page).

RC GetNextPage (PageNum current, PF_PageHandle &pageHandle)

This method reads into memory the next valid page after the page whose number is current. If the page fetch is successful, pageHandle becomes a handle for the page. The page read is pinned in the buffer pool until it is unpinned by calling the UnpinPage method. This method returns PF_EOF if end-of-file is reached (meaning there is no next page). Note that it is not an error if current does not correspond to a valid page (e.g., if the page numbered current has been disposed of).

RC GetPreviousPage (PageNum current, PF_PageHandle &pageHandle)

This method reads into memory the valid page previous to the page whose number is current. If the page fetch is successful, pageHandle becomes a handle for the page. The page read is pinned in the buffer pool until it is unpinned by calling the UnpinPage method. This method returns PF_EOF if end-of-file is reached (meaning there is no previous page). Note that it is not an error if current does not correspond to a valid page (e.g., if the page numbered current has been disposed of).

RC GetThisPage (PageNum pageNum, PF_PageHandle &pageHandle)

This method reads into memory the page specified by pageNum. If the page fetch is successful, pageHandle becomes a handle for the page. Parameter pageNum must be a valid page number. As usual, the page read is pinned in the buffer pool until it is explicitly unpinned.

RC AllocatePage (PF_PageHandle &pageHandle)

This method allocates a new page in the file, reads the new page into memory, and pins the new page in the buffer pool. If successful, pageHandle becomes a handle for the new page.

RC DisposePage (PageNum pageNum)

This method disposes of the page specified by pageNum. After this method is executed, if you scan over the pages of the file, the page numbered pageNum will no longer appear. It is a (positive) error to attempt to dispose of a page that is pinned in the buffer pool.

RC MarkDirty (PageNum pageNum)

This method marks the page specified by pageNum as "dirty," indicating that the contents of the page have been or will be modified. The page must be pinned in the buffer pool. A page marked as dirty is written back to disk when the page is removed from the buffer pool. (Pages not marked as dirty are never written back to disk.)

RC UnpinPage (PageNum pageNum)

This method tells the PF component that the page specified by pageNum is no longer needed in memory.

RC ForcePages (PageNum pageNum = ALL_PAGES)

This method copies the contents of the page specified by pageNum from the buffer pool to disk if the page is in the buffer pool and is marked as dirty. The page remains in the buffer pool but is no longer marked as dirty. If no specific page number is provided (i.e., pageNum = ALL_PAGES), then all dirty pages of this file that are in the buffer pool are copied to disk and are no longer marked as dirty. Note that page contents are copied to disk whether or not a page is pinned.

*** PF_PageHandle Class ***

The PF_PageHandle class provides access to the contents of a given page. To access the contents of a page, a client first creates an instance of this class and passes it to one of the PF_FileHandle methods described above.
class PF_PageHandle {
  public:
       PF_PageHandle  ();                          // Default constructor
       ~PF_PageHandle ();                          // Destructor
       PF_PageHandle  (const PF_PageHandle &pageHandle); 
                                                   // Copy constructor
       PF_PageHandle& operator= (const PF_PageHandle &pageHandle);
                                                   // Overload =
    RC GetData        (char *&pData) const;        // Set pData to point to
                                                   //   the page contents
    RC GetPageNum     (PageNum &pageNum) const;    // Return the page number
 };

PF_PageHandle (const PF_PageHandle &pageHandle)

This method is the copy constructor. When a new page handle object is created from a page handle object that refers to a pinned page in the buffer pool, the page is not pinned a second time.

PF_PageHandle& operator= (const PF_PageHandle &pageHandle)

This method overloads the = operator when it is used to assign one page handle object to another. As with the copy constructor, if the page handle object on the right-hand side of the = refers to a pinned page, the page is not pinned a second time.

RC GetData (char *&pData) const

This method provides access to the actual contents of a page. The PF_PageHandle object for which this method is called must refer to a page that is pinned in the buffer pool. If the method is successful, pData is set to point to the contents of the page in the buffer pool.

RC GetPageNum (PageNum &pageNum) const

This method sets pageNum to the number of the page referred to by the PF_PageHandle object for which this method is called. The page handle object must refer to a page that is pinned in the buffer pool.

*** PF_PrintError ***

void PF_PrintError (RC rc)

This routine -- not part of a PF component C++ class -- is called to write a message associated with the nonzero PF return code rc onto the Unix stderr output stream. This routine has no return value.

Return Codes and Error Handling
Each method in the PF component except constructors and destructors returns either 0, indicating normal completion, or a nonzero return code, indicating an exception condition or error. All nonzero return codes are defined as integer constants in file pf.h.

The following are the positive return codes used by the PF component. These return codes indicate exception conditions, or errors from which the system should be able to recover or exit gracefully.

      PF_EOF              // end of file
      PF_PAGEPINNED       // page pinned in buffer
      PF_PAGENOTINBUF     // page to be unpinned is not in buffer
      PF_PAGEUNPINNED     // page already unpinned
      PF_PAGEFREE         // page already free
      PF_INVALIDPAGE      // invalid page number
      PF_FILEOPEN         // file handle already open
      PF_CLOSEDFILE       // file is closed

The following are the negative return codes used by the PF component. These return codes indicate errors from which the system probably cannot recover. The second group of these return codes indicate internal errors within the PF component. If you come across one of these, please report it to the TA.

      PF_NOMEM            // out of memory
      PF_NOBUF            // out of buffer space
      PF_INCOMPLETEREAD   // incomplete read of page from file
      PF_INCOMPLETEWRITE  // incomplete write of page to file
      PF_HDRREAD          // incomplete read of header from file
      PF_HDRWRITE         // incomplete write of header to file
 // Internal PF errors:
      PF_PAGEINBUF        // new allocated page already in buffer
      PF_HASHNOTFOUND     // hash table entry not found
      PF_HASHPAGEEXIST    // page already exists in hash table
      PF_INVALIDNAME      // invalid file name
      PF_UNIX             // Unix error

When a client method calls a PF method and gets a nonzero return code, the client may need to return a nonzero code as well. Before returning, the client may first want to print a message associated with the PF return code; this is done by calling PF_PrintError.

Tracking Buffer Behavior
The PF component of the RedBase system can track statistics about the behavior of the buffer manager. Statistics are collected and reported only if flag -DPF_STATS is included during compilation. You will find these statistics of particular benefit if you are planning a good showing in the RedBase Efficiency Contest! The total number of read-page and write-page requests that the buffer manager issues is the number you want to minimize. In addition to buffer I/O statistics, we track the total number of page requests sent to the buffer manager, along with the number of pages found in the buffer versus those that needed to be fetched from disk.

If you want to see the statistics while developing Parts 1 and 2 of the project, you will need to call PF_Statistics(), defined in pf_statistics.cc. It sends to stdout all statistics currently being tracked by the system. Be sure to include the following line at the top of the file that calls PF_Statistics():

extern void PF_Statistics();

One of the test programs for the PF component, pf_test2.cc, includes extensive examples of how statistics are tracked and printed. For Parts 3 and 4 of the project we will provide simple commands to reset and display statistics as part of the RedBase command-line interface.

The buffer statistics are tracked via a StatisticsMgr class that is encapsulated within the PF component. The StatisticsMgr class actually provides a very general and easy-to-use tool for tracking any statistics about the behavior of your RedBase system. If you want to use this tool to track other statistics, please see the document describing the StatisticsMgr class.