A Fast File System for UNIX

McKusick, Joy, Leffler and Fabry

Old File System:

Simple and elegant, but slow, can only achieved 2% of disk bandwidth.
Blocks are too small, make large file transfers inefficient.
Consecutive data blocks of files not close together on disk.
Free list initially created for optimal access, but as time goes by, it becomes completely random, causing files to have their data blocks allocated randomly on disk, which forces a seek on every block access.
i-nodes far from data. 4MB of inodes followed by 146MB of data, this requires a long seek to get to the data after accessing the inode.
i-nodes of directory not close together, which leads to poor performance on operations such as "ls".

Larger block size (4096 or 8192 bytes), so that larger blocks can be transfer in a single disk transaction, which can greatly increase file system throughput. Blocks are not made larger because most files on UNIX FS are small (less than 4K).
Uses fragments to avoid wasting space in large blocks. E.g. 4K/1K FS waste same % of space as 1K FS but can transfer 4K blocks.
Divides disk into groups of consecutive cylinders = cylinder groups. Each group contains super-block, i-nodes, bitmap of free blocks and usage summary info. Try to put related data into a cylinder group, and spread out unrelated data as far as possible.
The headers of cylinder groups spiral down the platters to prevent loss on crash.

On expansion to a file, FS tries to allocate space that can localize blocks in the file.
Tries to allocate new blocks that are rotationally optimal, and on the same cylinder as the previous block.
If the number of free blocks goes to zero, the performance is degraded dramatically, because of the inability to carry out the optimizations outlined above.

Conflicting goals: Cluster related data so that a large transfer is possible, but spread out unrelated data so the disk won't fill up in any one area. The problem with allocating all the data blocks in the same cylinder group is that large files will quickly use up available space in the group, forcing a spill over (current and further allocations) to other areas.
Switch to another cylinder groups at certain file size boundaries.
Keep directory within a cylinder group. When allocating for a new directory, find the cylinder group with the most free indoes.
Spread out different directories.
Global policy allocates files and directories to cylinder groups. Picks "optimal" next block for block allocation according to the goals outlined above.
Global policy calls local allocation routines to request for specific blocks. If cannot be fulfilled, then can choose from a sequence of alternatives: rotationally closest, on same cylinder group, quadratic hash for another group, and finally, exhaustive search.

20-40% of disk bandwidth for large reads/writes.
10-20x original UNIX speeds.
Writes are always slower than reads, because of the allocation costs in writes. In the old FS, writes are about 50% faster than reads, because of the disk cache buffer in which write requests are reordered to allow minimal seek distances. In the new FS, although writes are already presented to the disk in optimal order, it takes time to find an "optimal" position for a new block.
Ignore technology trends? Reads less likely than writes, files grow fast (leading to many writes).

Long file names.
Advisory file locks: only applied when it is requested by a program. This allows sys admin to participate in the locking mechanisms, because sys admin can override any protection scheme.
symbolic links across file systems, intermachines.
atomic rename capability.
disk quotas: soft limits issues warning until hard limit is reached.

Parametrize FS implementation for the hardware it is running on. (E.g. rotational latency, disk bandwidth, etc.).
Measurement-driven design decisions.
Locality wins.