Random Notes During Thomas Wenisch Talk
Random notes during Thomas Wenisch talk at UIUC
GPU world underutilization problem
- Unpopular model sitting there in
- Lack of security for multi-tenancy
- Lack of fast core-provisioning?
- Lack of performance isolation in per-core allocation and
- Slow reaction of scaling compared to CPU
- Multi-tenancy is not well understood
- Model loading is much slower. Either
- Network devices are much cheaper so accelerator. So just throw money for network contention problem
- Interference in network is not a big issue in training since mostly
Tiered model checkpoint?
- SSD
- Flash
- Memory
Training
- Everybody wants to talk to everybody together
- One port not all ports maybe
- Everybody yells at once
- Bursty traffic
In-switch computing
- Encryption is the challenge
- Key distribution
- Switch needs to participate in encryption (key distribution),
We don’t want fabric to be bottleneck.
Planned OCS angles for training
different topology needs different hardware,
- Cable length
- Shallow buffer vs deeper buffer in switch