Follow
Zhen Zhang
Zhen Zhang
Applied Scientist, Amazon Web Services
Verified email at amazon.com - Homepage
Title
Cited by
Cited by
Year
{PipeSwitch}: Fast pipelined context switching for deep learning applications
Z Bai, Z Zhang, Y Zhu, X Jin
14th USENIX Symposium on Operating Systems Design and Implementation (OSDI …, 2020
1422020
Is network the bottleneck of distributed training?
Z Zhang, C Chang, H Lin, Y Wang, R Arora, X Jin
Proceedings of the Workshop on Network Meets AI & ML, 8-13, 2020
882020
Gemini: Fast failure recovery in distributed training with in-memory checkpoints
Z Wang, Z Jia, S Zheng, Z Zhang, X Fu, TSE Ng, Y Wang
Proceedings of the 29th Symposium on Operating Systems Principles, 364-381, 2023
622023
MiCS: near-linear scaling for training gigantic model on public cloud
Z Zhang, S Zheng, Y Wang, J Chiu, G Karypis, T Chilimbi, M Li, X Jin
arXiv preprint arXiv:2205.00119, 2022
432022
Oobleck: Resilient distributed training of large models using pipeline templates
I Jang, Z Yang, Z Zhang, X Jin, M Chowdhury
Proceedings of the 29th Symposium on Operating Systems Principles, 382-395, 2023
372023
{DISTMM}: Accelerating Distributed Multimodal Model Training
J Huang, Z Zhang, S Zheng, F Qin, Y Wang
21st USENIX Symposium on Networked Systems Design and Implementation (NSDI …, 2024
112024
TKPERM: cross-platform permission knowledge transfer to detect overprivileged third-party applications
FH Shezan, K Cheng, Z Zhang, Y Cao, Y Tian
Network and Distributed Systems Security (NDSS) Symposium, 2020
112020
Distmind: Efficient resource disaggregation for deep learning workloads
X Jin, Z Bai, Z Zhang, Y Zhu, Y Zhong, X Liu
IEEE/ACM Transactions on Networking 32 (3), 2422-2437, 2024
72024
Towards a secure zero-rating framework with three parties
Z Liu, Z Zhang, Y Cao, Z Xi, S Jing, H La Roche
27th USENIX Security Symposium (USENIX Security 18), 711-728, 2018
42018
PipeFill: Using GPUs During Bubbles in Pipeline-parallel LLM Training
D Arfeen, Z Zhang, X Fu, GR Ganger, Y Wang
arXiv preprint arXiv:2410.07192, 2024
12024
SDCC: Software-defined collective communication for distributed training
X Jin, Z Zhang, Y Jia, Y Ma, X Liu
Science China Information Sciences 67 (9), 192104, 2024
12024
Decoupled Model Schedule for Deep Learning Training.
H Chen, CH Yu, S Zheng, Z Zhang, Z Zhang, Y Wang
arXiv preprint arXiv:2302.08005, 2023
12023
MiCS: Near-linear Scaling for Training Gigantic Model on Public Cloud
Z Zhang, S Zheng, Y Wang, J Chiu, G Karypis, T Chilimbi, M Li, X Jin
Proceedings of the VLDB Endowment, 0
The system can't perform the operation now. Try again later.
Articles 1–13