v0.15.2 Patch Release
What's Changed
- Update version.txt after 0.15.1 release by @loadams in #6493
- HPU: add required ENV vars to acccelerator init by @nelyahu in #6495
- Op_builder->is_compatible quite warning by @terry-for-github in #6093
- fix pipeline eval_batch micro_batches argument for schedule by @nelyahu in #6484
- Fix the broken url link by @rogerxfeng8 in #6500
- fix environment variable export bug for MultiNodeRunner by @TideDra in #5878
- Revert "BF16 optimizer: Clear lp grads after updating hp grads in hook" by @nelyahu in #6508
- wrap include cuda_bf16.h with ifdef BF16_AVAILABLE by @oelayan7 in #6520
- Avoid security issues of subprocess shell by @tjruwase in #6498
- Add conditional on torch version for scaled_dot_product_attention by @loadams in #6517
- Added Intel Gaudi to Accelerator Setup Guide by @ShifaAbu in #6543
- Skip failing newly added tests in accelerate by @loadams in #6574
- Use msgpack for p2p comm by @tohtana in #6547
- DeepNVMe perf tuning by @tjruwase in #6560
- [Accelerator] Cambricon MLU support by @Andy666G in #6472
- Fix gradient accumulation for Z2+offload by @tohtana in #6550
- fix errors when setting zero3 leaf modules with torch.compile by @NirSonnenschein in #6564
- [XPU] Support DeepNVMe new code structure by @Liangliang-Ma in #6532
- Add APIs to offload states of model, optimizer, and engine by @tohtana in #6011
- add bfloat16 to inference support dtypes by @nelyahu in #6528
- [COMPILE] workflow for deepspeed + torch.compile by @YizhouZ in #6570
- Fixes on the accelerate side mean we do not need to skip this test by @loadams in #6583
- Fix torch include in
op_builder/mlu/fused_adam.py
and update no-torch workflow triggers by @loadams in #6584 - [ROCm] Fix subprocess error by @jagadish-amd in #6587
- Cleanup CODEOWNERS file to be valid by @loadams in #6603
- Add SSF Best practices badge by @loadams in #6604
- Move V100 workflows from cuda 11.1/11.7 to 12.1 by @loadams in #6607
- Fix SD workflow by @loadams in #6609
- Pin accelerate to fix CI failures/issues by @loadams in #6610
- Add llama3.2 vision autotp by @Yejing-Lai in #6577
- Improve DS logging control by @tjruwase in #6602
- Fix device selection using CUDA_VISIBLE_DEVICES by @tohtana in #6530
- Handle when
backend
is also in compile_kwargs by @oraluben in #6502 - Rearrange inference OPS and stop using builder.load by @oelayan7 in #5490
- Unpin accelerate tests, update lightning with node16 removal. by @loadams in #6611
- Enabled Qwen2-MoE Tensor Parallelism (TP) inference by @gyou2021 in #6551
New Contributors
- @TideDra made their first contribution in #5878
- @ShifaAbu made their first contribution in #6543
- @jagadish-amd made their first contribution in #6587
- @gyou2021 made their first contribution in #6551
Full Changelog: v0.15.1...v0.15.2