The Zebra Striped Network File System
John H. Hartman and John
K. Ousterhout
One-line summary: Distributed file system, performing RAID-like
striping in software on a per-client basis using LFS techniques.
Overview/Main Points
- Building blocks:
- RAID - Disk array, transfers divided into striping units, each unit to a
different disk in the array. Set of consecutive striping units is a stripe.
Includes a parity unit. Small writes - 4 times as expensive as in disk array
without parity. (read old, read old parity, write new, write new parity.)
Also, memory and I/O are potential performance bottleneck.
- NFS - can do per-file striping, or each file in its own set of stripes.
Small files - either striped across all servers, in which case network and
disk overhead dominates, or placed on single server, then parity consumes as
much space as file itself. Data plus parity writes must both been done as
single atomic operation.
- LFS - Zebra is LFS plus per-client striping.
- Zebra components:
- clients: contact file manager to get block pointers, then contact
storage servers to get data. If write, buffer data until get full fragment
(512 KB!!??!!).
- storage servers: operate on stripe fragments, indexed by (client
identifier, stripe sequence number, fragment offset within stripe) as opaque
identifier. Can store (synchronously), append to (atomically so crashes ok),
retrieve, delete, or find most recent fragment.
- file manager: stores all file metadata (protection info, block pointers
to file data, directories, symlinks, special files for I/O devices, ...).
Does name lookup and cache consistency. Only stores block pointers, never
file data. Implemented as Sprite LFS, so get all Sprite LFS consistency,
fault tolerance, etc. for free.
- stripe cleaner: like LFS segment cleaner. User-level process. Uses LFS
cleaning policy.
- Operational details:
- Deltas - changes to blocks in a file. Contain file ID, file version
number (for delta ordering across logs in crash recovery), block number, old
block pointer, new block pointer. (old and new pointers to detect race
condition between clients and cleaner.) Deltas stored in client's logs.
- Writing files - batch a fragment write plus update delta, increment file
version number. Transfer fragments to all servers concurrently by
asynchronous RPC; client computes parity, and handles stripe deltas. Delay
parity writing for small files - disk crash ok as long as client survives,
as it has parity.
- Reading files - file manager must know about all open/closes for cache
consistency. Client fetches block pointers, then fetches data - 2 RPC
roundtrips. Prefetching of data for large files, and of files for small
files (locality in file access assumed).
- Stripe cleaning - utilization computed by the stripe cleaner, by
processing deltas from client logs, appending all that refer to a given
stripe to a "stripe status file", which is used to identify live blocks
without having to search through all logs. Cleaning similar to read+write of
block, but don't don't file open, don't do cache consistency, don't do
user/kernerl data copying, don't update modify times or version numbers, and
generate "cleaner delta" instead of update delta.
- File access/cleaning conflicts: optimistic approach. Cleaner does its
work, issues cleaner delta. File manager looks at cleaner deltas and update
deltas, detects conflict by looking at old block pointer in deltas and
comparing with metadata, and favours client update deltas over cleaner
deltas, issuing a "reject delta" so cleaner knows about conflict. If block
cleaned before client reads it but after metadata fetch, get error and retry
metadata fetch.
- Adding storage servers on the fly - need to keep track of how many
storage servers make up a stripe group for some files. Cleaning moves files
to larger stripe group over time.
- Consistency after crashes: three issues not in LFS/Sprite.
- internal stripe consistency - fragments in the process of being written
are missing. If partial fragment, use checksum to detect. If a storage
server misses fragments while it is down, use parity + neighbours to
recover.
- stripes vs. metadata - file manager keeps track of current position in
each client's log, and periodically checkpoints metadata. After file manager
crash, reprocess all deltas after checkpoints. Version numbers give
idempotency, and ordering of deltas across all clients' logs.
- stripes vs. cleaner - if stripe cleaner crashes, needs to recover state.
Checkpoint stripe cleaner state.
- Performance:
- 4-5x improvement for large reads/writes because of parallelism.
- negligible improvement for small reads/writes because clients contact
central point on each open/close. (Name caching would fix. Zebra doesn't
give concurrent write sharing anyway.)
- large file write - if single server, disk is bottleneck. At 4 servers,
FDDI saturation stops linear scalability. Parity computation by client is
expensive - like doing (N)/(N-1) writes for N storage servers.
- large file read - 2 servers saturate single client (data copies between
app, cache, and network).
- CPU at FM and client CPU is bottleneck for small writes because of
synchronous RPCs to open/close files.
Relevance
Ideas in Zebra were generalized to result in xFS. Distributed,
reliable, parallel file system - one day could become the InternetFS?
Flaws
- file manager as centralization point is clearly flawed.
- why must file manager track all open/closes if don't promise concurrent
write sharing consistency?
- clients do much of the work that could be pushed into the infrastructure.