杂七杂八–AI及AI系统学习材料
- 并行通信(distributed training parallelsim):https://zhuanlan.zhihu.com/p/655939356
- 中科大郑烇b站计算机网络 讲了类似alpha-beta计算https://www.bilibili.com/video/BV1JV411t7ow?spm_id_from=333.788.videopod.episodes&vd_source=c007795d296b25971bfa989ca8c1b43e&p=8
-
英伟达collective communication材料
- Gurobi 学习材料
-
TECCL talk
GitHub:https://github.com/microsoft/TE-CCL
Conference talk:https://www.youtube.com/watch?v=ChjWIwM87LY
-
Collective Communication 的相关review!
Wei et al_2024_Communication Optimization for Distributed Training.pdf
-
深度学习的分布式训练与集合通信(by 昇腾AI开发者)
Part1
https://www.bilibili.com/opus/997347047443005495?spm_id_from=333.1387.0.0
Part2
https://www.bilibili.com/opus/1002538906998013960
-
跨数据中心训练
- 不同并行方式及对应通信范式整理
-
All-Reduce公式推导(Section 2.2)
Jiang 等 - A Unified Architecture for Accelerating Distributed.pdf
-
数据并行+流水线并行在跨云通信当中的公式推导
Strati 等 - 2024 - ML Training with Cloud GPU Shortages Is Cross-Reg.pdf
-
各种混合并行方式的整理
Kahira et al_2021_An Oracle for Guiding Large-Scale Model-Hybrid Parallel Training of.pdf
-
大模型分布式训练特性解读(包含不同并行方式的拆解)
Li et al_2024_Understanding Communication Characteristics of Distributed Training.pdf
-
- 分布式训练并行方式论文库
- https://github.com/DicardoX/Research-Space
-
一系列算力网络白皮书
-
华为HCCL手册
https://www.hiascend.com/document/detail/zh/canncommercial/80RC3/developmentguide/hccl/hcclug/hcclug_000009.html
-
集合通信技术
https://mp.weixin.qq.com/s/oTIYRJV-pqwgg8S7uBb25A