One-line summary: Repeated applications of Amdahl's law (optimize the common case) leads to a fast system.
Context switches - don't save floating point registers unless thread actually uses them. (Illegal instruction traps to detect this, restart context switch code in this case.) Lack of a thread dispatcher - chain together threads directly in a circular queue, using procedure chaining on thread's synthesized context switch procedure. Vector tables associated with threads for interrupt handlers.
Signals and interrupts - alter threads TTE to make thread jump to signal handler then reactived. Error traps - must be synchronous, so actually have error trap handler copy a kernel stack frame onto the user stack, modify return address of kernel stack to point to user error signal procedure, and execute a return from exception.
Scheduling is done by changing the CPU quantum assigned to a thread, and reordering the ready queu (the circular TTE queue).