The Zebra Striped Network File System

John H. Hartman and John K. Ousterhout

One-line summary: Distributed file system, performing RAID-like striping in software on a per-client basis using LFS techniques.

Overview/Main Points

Building blocks:
- RAID - Disk array, transfers divided into striping units, each unit to a different disk in the array. Set of consecutive striping units is a stripe. Includes a parity unit. Small writes - 4 times as expensive as in disk array without parity. (read old, read old parity, write new, write new parity.) Also, memory and I/O are potential performance bottleneck.
- NFS - can do per-file striping, or each file in its own set of stripes. Small files - either striped across all servers, in which case network and disk overhead dominates, or placed on single server, then parity consumes as much space as file itself. Data plus parity writes must both been done as single atomic operation.
- LFS - Zebra is LFS plus per-client striping.
Zebra components:
- clients: contact file manager to get block pointers, then contact storage servers to get data. If write, buffer data until get full fragment (512 KB!!??!!).
- storage servers: operate on stripe fragments, indexed by (client identifier, stripe sequence number, fragment offset within stripe) as opaque identifier. Can store (synchronously), append to (atomically so crashes ok), retrieve, delete, or find most recent fragment.
- file manager: stores all file metadata (protection info, block pointers to file data, directories, symlinks, special files for I/O devices, ...). Does name lookup and cache consistency. Only stores block pointers, never file data. Implemented as Sprite LFS, so get all Sprite LFS consistency, fault tolerance, etc. for free.
- stripe cleaner: like LFS segment cleaner. User-level process. Uses LFS cleaning policy.
Operational details:
- Deltas - changes to blocks in a file. Contain file ID, file version number (for delta ordering across logs in crash recovery), block number, old block pointer, new block pointer. (old and new pointers to detect race condition between clients and cleaner.) Deltas stored in client's logs.
- Writing files - batch a fragment write plus update delta, increment file version number. Transfer fragments to all servers concurrently by asynchronous RPC; client computes parity, and handles stripe deltas. Delay parity writing for small files - disk crash ok as long as client survives, as it has parity.
- Reading files - file manager must know about all open/closes for cache consistency. Client fetches block pointers, then fetches data - 2 RPC roundtrips. Prefetching of data for large files, and of files for small files (locality in file access assumed).
- Stripe cleaning - utilization computed by the stripe cleaner, by processing deltas from client logs, appending all that refer to a given stripe to a "stripe status file", which is used to identify live blocks without having to search through all logs. Cleaning similar to read+write of block, but don't don't file open, don't do cache consistency, don't do user/kernerl data copying, don't update modify times or version numbers, and generate "cleaner delta" instead of update delta.
- File access/cleaning conflicts: optimistic approach. Cleaner does its work, issues cleaner delta. File manager looks at cleaner deltas and update deltas, detects conflict by looking at old block pointer in deltas and comparing with metadata, and favours client update deltas over cleaner deltas, issuing a "reject delta" so cleaner knows about conflict. If block cleaned before client reads it but after metadata fetch, get error and retry metadata fetch.
- Adding storage servers on the fly - need to keep track of how many storage servers make up a stripe group for some files. Cleaning moves files to larger stripe group over time.
Consistency after crashes: three issues not in LFS/Sprite.
1. internal stripe consistency - fragments in the process of being written are missing. If partial fragment, use checksum to detect. If a storage server misses fragments while it is down, use parity + neighbours to recover.
2. stripes vs. metadata - file manager keeps track of current position in each client's log, and periodically checkpoints metadata. After file manager crash, reprocess all deltas after checkpoints. Version numbers give idempotency, and ordering of deltas across all clients' logs.
3. stripes vs. cleaner - if stripe cleaner crashes, needs to recover state. Checkpoint stripe cleaner state.
Performance:
- 4-5x improvement for large reads/writes because of parallelism.
- negligible improvement for small reads/writes because clients contact central point on each open/close. (Name caching would fix. Zebra doesn't give concurrent write sharing anyway.)
- large file write - if single server, disk is bottleneck. At 4 servers, FDDI saturation stops linear scalability. Parity computation by client is expensive - like doing (N)/(N-1) writes for N storage servers.
- large file read - 2 servers saturate single client (data copies between app, cache, and network).
- CPU at FM and client CPU is bottleneck for small writes because of synchronous RPCs to open/close files.

Relevance

Ideas in Zebra were generalized to result in xFS. Distributed, reliable, parallel file system - one day could become the InternetFS?

Flaws

file manager as centralization point is clearly flawed.
why must file manager track all open/closes if don't promise concurrent write sharing consistency?
clients do much of the work that could be pushed into the infrastructure.