Releases: microsoft/DeepSpeed
Releases · microsoft/DeepSpeed
v0.8.3: Patch release
What's Changed
- [deepspeed/autotuner] Bug fix for skipping mbs on gas by @rahilbathwal5 in #2171
- Fix issue between our abstract accelerator and colossalai's version of op_builder by @jeffra in #2963
- [zero] prevent poor configs from running w. zero-offload by @jeffra in #2971
- Fix Meta Tensor checkpoint load for OPT models by @lekurile in #2990
- ckpt: create directories in checkpoint_engine by @adammoody in #2988
- Fix buffer size for pipeline parallel and communication schedule by @tohtana in #2862
- [docs] add new paper to readme/docs by @jeffra in #3018
- fix language by @stas00 in #3019
- BF Optimizer Attribute Checks by @jomayeri in #3022
- [logger] implement
logger.warning_once
by @stas00 in #3021 - Convert model parameters from generator to list. by @jomayeri in #3017
- Improve loss overflow logs by @Quentin-Anthony in #3008
- Fix Broken Links by @satpalsr in #3048
New Contributors
Full Changelog: v0.8.2...v0.8.3
v0.8.2: Patch release
What's Changed
- add auto-generated PR workflow by @mrwyattii in #2822
- Fix typo in auto-sync workflow by @mrwyattii in #2850
- Fix example command for building wheel with dev version specified. by @loadams in #2815
- Create tensor parallelism blog/tutorial by @molly-smith in #2766
- Data efficiency library update by @conglongli in #2866
- Make z3 respect comm dtype by @tjruwase in #2807
- Automatic Tensor Parallelism Blog Links by @molly-smith in #2877
- Check device count before running dist tests by @HeyangQin in #2799
- AutoTP tutorial web formatting and news by @molly-smith in #2883
- Remove deprecated
torch._six
imports by @yasyf in #2863 - Reduce I/O size by @tjruwase in #2814
- add missing license info to top of all source code by @jeffra in #2889
- Enable tensor fragments for zero 2 & 3 by @tjruwase in #2727
- better eval sampler for val or test dataset by @mayank31398 in #2907
- using container when loading inference checkpoints by @HeyangQin in #2875
- Fix CPUAdam for when
vendor_id_raw
is not provided by @FarzanT in #2836 - Fix Bloom logits mismatch by @molly-smith in #2851
- Fixes
AttributeError
in #2853 by @saforem2 in #2854 - Add MPICH Multinode Runner by @inkcherry in #2839
- TP unsupported models and assertions by @molly-smith in #2810
- AutoTP Assert Kernel Injection Support by @molly-smith in #2939
- Check for local CUDA graphs when enable_cuda_graph=True by @lekurile in #2941
- Improve overflow handling by @tjruwase in #2944
- [RFC] add device abstraction to allow other device than CUDA be used by @delock in #2221
- deepspeed.init_distributed() support for TCP protocols by @noabauma in #2905
New Contributors
- @HeyangQin made their first contribution in #2799
- @yasyf made their first contribution in #2863
- @mayank31398 made their first contribution in #2907
- @FarzanT made their first contribution in #2836
- @saforem2 made their first contribution in #2854
- @noabauma made their first contribution in #2905
Full Changelog: v0.8.1...v0.8.2
v0.8.1: Patch release
What's Changed
- CUDA optional deepspeed ops by @tjruwase in #2507
- Remove CI trigger for push to master by @mrwyattii in #2712
- [install] only add deepspeed pkg at install by @jeffra in #2714
- Fix nightly tests for new lm-eval release by @mrwyattii in #2713
- BF16 optimizer for BF16+ZeRO Stage 1 by @jomayeri in #2706
- Fix typo in diffusers transformer block by @mrwyattii in #2718
- Inference Refactor (replace_with_policy, model_implementations) by @awan-10 in #2554
- Change zero_grad() argument to match pytorch by @loadams in #2741
- Automatic tensor parallelism v2 by @molly-smith in #2670
- Fixing Optimizer Sanity Check by @jomayeri in #2742
- [GatheredParameters] fix memory leak by @stas00 in #2665
- Abstract accelerator (step 3) by @delock in #2677
- Fix autotuning so that it records Floating Point Operations per second, not microsecond by @dashstander in #2711
- fix a misspelled attribute by @stas00 in #2750
- [zero] remove misleading dtype log by @jeffra in #2732
- Fix softmax backward by @RezaYazdaniAminabadi in #2709
- Skip test_bias_gelu unit test if torch < 1.12 by @lekurile in #2754
- Conditionally Make Op Building More Verbose by @cmikeh2 in #2759
- Bing/formatting correction by @xiexbing in #2764
- Add links to new azureML examples by @cassieesvelt in #2756
- Fix hardcoded instances to fp16 in optimizer creation log messages to the correct dtype. by @loadams in #2743
- Refactor/Pydantify monitoring config by @mrwyattii in #2640
- Pin minimum
packaging
requirement by @carmocca in #2771 - Fix for diffusers v0.12.0 by @mrwyattii in #2753
- some fix in flops_profiler by @lucasleesw in #2068
- fix upsample flops compute by skipping unused kargs by @cli99 in #2773
- Fix broken kernel inject bug by @molly-smith in #2776
- Fix Checkpoint-loading with Meta-tensor by @RezaYazdaniAminabadi in #2781
- Add hjson support for user configs by @mrwyattii in #2783
- Reset KV-cache at the beginning of text-generation by @RezaYazdaniAminabadi in #2669
- Container param cleanup + remove qkv_merging by @lekurile in #2780
- Common location to install libaio-dev by @tjruwase in #2779
- Fixing broken link to azureml-examples recipes by @rtanase in #2795
- remove outdated comment by @stas00 in #2786
- Enable page-locked tensors without CUDA by @tjruwase in #2775
- Add container load checkpoint error reporting + refactor by @lekurile in #2792
- Add user defined launcher args for PDSH launcher by @loadams in #2804
- Fix Slurm launcher user args by @loadams in #2806
- Handle hanged tests in CI by @mrwyattii in #2808
- Fix inference CI device error by @mrwyattii in #2824
- Fix permissions issue with pip upgrade by @mrwyattii in #2823
- Fix cpu-only CI hangs by @mrwyattii in #2825
- Fix Pipeline Parallel resize unit test by @mrwyattii in #2833
- Fix auto TP for duplicate modules with different gems by @molly-smith in #2784
- Refactor DS inference API. No longer need replace_method. by @awan-10 in #2831
- Port Reza's INT8-quantization fix to container architecture by @lekurile in #2725
- Fix gpt-Neox rotary embedding implementation by @RezaYazdaniAminabadi in #2782
- Fix for CI failure on system upgrade by @mrwyattii in #2849
New Contributors
- @loadams made their first contribution in #2741
- @xiexbing made their first contribution in #2764
- @carmocca made their first contribution in #2771
- @lucasleesw made their first contribution in #2068
- @rtanase made their first contribution in #2795
Full Changelog: v0.8.0...v0.8.1
DeepSpeed v0.8.0
New features
- DeepSpeed Data Efficiency: A composable library that makes better use of data, increases training efficiency, and improves model quality
- DeepSpeed Data Efficiency Library by @conglongli in #2585
What's Changed
- fix blog link by @conglongli in #2600
- Migrate ops tests to new inference_ops marker by @cmikeh2 in #2599
- Move layer norm to new schedule by @lokoppakmsft in #2590
- [deepspeed/autotuner] Bug fix for binary search for batch size by @rahilbathwal5 in #2162
- Fix for older versions of pydantic by @mrwyattii in #2611
- Use rocm/pytorch:latest for ROCm Dockerfile by @jithunnair-amd in #2613
- skip torch.zeros and tensor.copy_ when model parallel is not used by @guoyejun in #2479
- call empty_cache to really free up GPU memory as described in comment by @guoyejun in #2620
- Remove GatheredParameters context from replace_with_policy by @lekurile in #2591
- fixes #2498 by @clumsy in #2603
- Update AVX512 Detection by @cmikeh2 in #2621
- Add Megatron CI workflow by @mrwyattii in #2614
- [inference] check for unsupported model generate args by @jeffra in #2627
- [launcher] parse hostfile via regex and added error checks by @jeffra in #2626
- Unit tests setup own venv by @mrwyattii in #2628
- Fix #2409: add enable_each_rank_log to deepspeed/launcher/runner.py by @inkcherry in #2571
- Fix typo in autotuner.py by @eltociear in #2639
- [zero-3] Handle forward parameter return correctly in nested cases by @samyam in #2642
- [inference] ds-attention refactor w.r.t. ops by @jeffra in #2623
- Fix issue w. bloom int8 when changing tp size by @jeffra in #2645
- fix assertion error in zero stage 3 by @GuanhuaWang in #2647
- tweaks to ds-attn, distilbert policy, and mup by @jeffra in #2649
- [doc] fix
min_loss_scale
default by @stas00 in #2660 - [launcher] fail gracefully if hostname -i doesn't work as expected by @jeffra in #2631
- Fix Opt injection by @RezaYazdaniAminabadi in #2541
- Abstract accelerator (step 2) by @delock in #2560
- Remove unnecessary device synchronization for stage 2 by @li-yi-dong in #2500
- [Bug Fixed] torch.cuda.is_available -> torch.cuda.is_available() by @wkcn in #2661
- [fp16] lower
initial_scale_power
to16
by @stas00 in #2663 - fix Tensor contiguous bug in model_compression by @xiaoxiawu-microsoft in #2671
- [inference] ds-mlp refactor w.r.t. ops by @jeffra in #2668
- real_accelerator validation check for both accelerator and deepspeed accelerator path by @delock in #2685
- fix typo and remove duplicated code in ZeRO stage 1 and 2 by @wkcn in #2655
- Add mlflow logging for aml by @cassieesvelt in #2495
- Fix import error of op_builder by @tohtana in #2687
- Pass training flag to forward call from module config by @lokoppakmsft in #2604
- Extend quantization utils features by @lokoppakmsft in #2683
- [GatheredParameters] add support for any iterable by @stas00 in #2664
- Fix for latest diffusers by @mrwyattii in #2699
- exclude benchmarks during install by @jeffra in #2698
- Correct loss scale in ZeRO step by @jomayeri in #2695
- [ZeRO] non-MoE stage 1 requires CG disabled by @jeffra in #2703
- remove print side effect from importing deepspeed by @jeffra in #2704
- ZeRO3 handling frozen weights by @tjruwase in #2653
New Contributors
- @eltociear made their first contribution in #2639
- @li-yi-dong made their first contribution in #2500
- @wkcn made their first contribution in #2661
- @xiaoxiawu-microsoft made their first contribution in #2671
- @cassieesvelt made their first contribution in #2495
- @tohtana made their first contribution in #2687
Full Changelog: v0.7.7...v0.8.0
v0.7.7: Patch release
What's Changed
- Update the locator for Megatron-LM by @rapsealk in #2564
- use get_global_rank if available by @jeffra in #2567
- Add Determined to open-source DL frameworks by @sirredbeard in #2573
- Support fp32 gradaccum for bf16 model by @delock in #2566
- Drop Maxwell Support by @cmikeh2 in #2574
- Fix quantized-inference & Add generic support of checkpoint loading by @RezaYazdaniAminabadi in #2547
- Fix MegatronLayerPolicy to have megatron_v2=True by @lekurile in #2579
- Update barrier and reduce_scatter_base to conform to PyTorch signatures by @Quentin-Anthony in #2570
- Support N-dimension input in quantization kernel by @lokoppakmsft in #2575
- Add checkpoint sharding unit tests by @mrwyattii in #2561
- Updating docs README by @jomayeri in #2587
- Updating API docs by @jomayeri in #2586
- Fix issues w. python 3.6 + add py-version checks to CI by @jeffra in #2589
- [benchmarks] get mask token from tokenizer by @jeffra in #2592
New Contributors
- @rapsealk made their first contribution in #2564
- @sirredbeard made their first contribution in #2573
Full Changelog: v0.7.6...v0.7.7
v0.7.6: Patch release
What's Changed
- DeepSpeed inference config. (#2459) by @awan-10 in #2472
- Update docs to autogenerate pydantic config model docs by @mrwyattii in #2509
- Add max_tokens alias to max_out_tokens arg to maintain backwards compatibility by @lekurile in #2508
- Deepspeed quantization library v0.1 by @lokoppakmsft in #2450
- Fix backward compatibility for InferenceConfig by @mrwyattii in #2516
- Add missing Inference sub-configs by @mrwyattii in #2518
- Add note about nvcc/hipcc requirement by @jeffra in #2519
- Update codeowners by @jeffra in #2525
- Dequantization Utils Library by @cmikeh2 in #2521
- Fixes for torch 1.14 due to new torch.numel return type by @jeffra in #2522
- Ensure MOE is initialized for SD by @cmikeh2 in #2534
- Make DS-Inference config readable from JSON by @mrwyattii in #2537
- Add MII tests by @mrwyattii in #2533
- Remove mutable default parameter in
init_inference()
by @aphedges in #2540 - Change Where DS/Triton is Used in Stable Diffusion by @cmikeh2 in #2536
- Pass down the new DS inference config to replace_transformer_layer. by @awan-10 in #2539
- Adding Gradient Accumulation Data Type Config by @jomayeri in #2512
- Report progress at gradient accumulation boundary by @ShijieZZZZ in #2553
- encoded ds config into command line argument when launching child processes in autotuning by @cli99 in #2524
- Add missing MoE fields to inference config for backward compatibility by @mrwyattii in #2556
- Abstract accelerator (step 1) by @delock in #2504
- Fix invalid check of recorded parameter orders in zero stage3. by @inkcherry in #2550
New Contributors
- @ShijieZZZZ made their first contribution in #2553
- @delock made their first contribution in #2504
- @inkcherry made their first contribution in #2550
Full Changelog: v0.7.5...v0.7.6
v0.7.5: Patch release
What's Changed
- Fix Bug #2319 by @jomayeri in #2438
- update pytorch pool operator function signiture by @cli99 in #2443
- Fix build issues on Windows by @eltonzheng in #2428
- rollback ds config changes by @cli99 in #2395
- Use CUDA events for inference model profiling by @mrwyattii in #2371
- Fixing a config mismatch in unit test. by @jomayeri in #2447
- Reduction Kernel Utility by @cmikeh2 in #2436
- deepspeed/launcher/launch.py: add option enable_each_rank_log by @guoyejun in #2409
- Fixes for various CI problems by @mrwyattii in #2457
- Cache Allocation and Softmax Fixes by @cmikeh2 in #2433
- Fix checkpoint loading at inference-engine by @RezaYazdaniAminabadi in #2429
- Create a new folder structure to isolate model-specific code in DS by @awan-10 in #2464
- don't gather partitioned activations for mp size 1 by @guoyejun in #2454
- Updating autotune json default in docs. by @jomayeri in #2476
- Added MLFLOW environment variables for logging metrics within trainig… by @savitamittal1 in #2477
- fix accelerate link in README by @kyoto7250 in #2481
- Fix Stable-Diffusion: Add correct memory-allocation at DeepSpeed-Attention by @RezaYazdaniAminabadi in #2474
- Fix CI issues related to cupy install by @mrwyattii in #2483
- Add
scale_attn_by_inverse_layer_idx
feature by @hyunwoongko in #2486 - Stable Diffusion Enhancements by @cmikeh2 in #2491
- stage_1_and_2.py: no allreduce needed when mp size is 1 by @guoyejun in #2494
- Make bf16_optimizer work for non pipeline parallelism by @tjruwase in #2470
- Fix nightly CI tests by @mrwyattii in #2493
- Make data contiguous before the inplace reshape-copy_ function. by @lokoppakmsft in #2489
- Fix typos: deepseed -> deepspeed by @jinyouzhi in #2499
New Contributors
- @guoyejun made their first contribution in #2409
- @savitamittal1 made their first contribution in #2477
- @kyoto7250 made their first contribution in #2481
- @lokoppakmsft made their first contribution in #2489
- @jinyouzhi made their first contribution in #2499
Full Changelog: v0.7.4...v0.7.5
v0.7.4: Patch release
What's Changed
- MOE residual matmult unit test by @samadejacobs in #2323
- MOE matmult with memaccess by @samadejacobs in #2336
- Refactor residual add kernels by @arashb in #2333
- mem access for quantize kernel by @GuanhuaWang in #2331
- increase min pre-commit versions by @jeffra in #2346
- Extend scratch buffer for long prompts by @cmikeh2 in #2212
- [docs] fix zero docs by @jeffra in #2350
- Staging profile inference v1 (#2348) by @awan-10 in #2349
- Kernel Data Conversion Utility by @cmikeh2 in #2327
- Add Onebit Optimizers in init by @l4d2boomer in #2340
- docs(mixture-of-experts-inference): fix typo in tuto by @jqueguiner in #2345
- Use blob storage for datasets in unit tests by @mrwyattii in #2342
- Refactor
gptj_residual_add
kernels for better readability by @arashb in #2358 - Updated issue templates by @jeffra in #2363
- fix cuda invalid config error in dequant kernel by @GuanhuaWang in #2362
- Add missing pytest fixture scope by @arashb in #2353
- Extend residual_add kernel tests to cover pre_attn_norm by @arashb in #2354
- Refactor
fused_bias_residual
kernels for better readability by @arashb in #2356 - Capture error message during sweep tests by @molly-smith in #2351
- Fix an exception when auto-casting dicts to fp16 by @mjksmith in #2370
- Refactor remaining distributed tests by @mrwyattii in #2216
- Fix the MLP output tensor's shape by @arashb in #2380
- add 11.8 to cuda_minor_mismatch_ok to allow building with current CUDA by @Thomas-MMJ in #2390
- Pin Transformers test version by @mrwyattii in #2402
- Change type to tuple in replace_wo_policy isinstance check by @lekurile in #2387
- Checkpoint backwards-compatbility workaround by @tjruwase in #2384
- Add Predicated Global Load to Memory Access Utils by @cmikeh2 in #2373
- MII blog post by @jeffra in #2418
- Fix figure reference by @awan-10 in #2419
- Add SLURM Multinode Runner by @dashstander in #2404
- Fix issue with corrupted output on long generation for GPT by @andrewchernyh in #2359
- Fix GPT Neo-X multi-gpu inference by @andrewchernyh in #2401
- CI fixes related to triton by @jeffra in #2422
- [docs] update mii blog title by @jeffra in #2423
- add SD injection policy by @jeffra in #2381
- Fix checkpoint loading when it is a dictionary by @RezaYazdaniAminabadi in #2425
- Make error regex more generic in collect_results.py by @molly-smith in #2415
- fixes #2389 by @clumsy in #2411
- Fix for inference gpt-j test by @mrwyattii in #2430
- Fixing bug 2361 by @jomayeri in #2410
- Universal checkpoint for zero stage 1 by @tjruwase in #2284
- only add deps if extra is explicitly called by @jeffra in #2432
- Add TestInjectionPolicy inference unittest class for testing custom injection policies by @lekurile in #2426
- [memory estimators] new config args sync by @stas00 in #2431
- parallelize writing of layer checkpoint files across data parallel instances by @adammoody in #1419
- Fix broken link to DeepSpeed Megatron fork by @lekurile in #2440
New Contributors
- @l4d2boomer made their first contribution in #2340
- @jqueguiner made their first contribution in #2345
- @mjksmith made their first contribution in #2370
- @Thomas-MMJ made their first contribution in #2390
- @lekurile made their first contribution in #2387
- @dashstander made their first contribution in #2404
- @andrewchernyh made their first contribution in #2359
- @clumsy made their first contribution in #2411
- @jomayeri made their first contribution in #2410
Full Changelog: v0.7.3...v0.7.4
v0.7.3: Patch release
What's Changed
- Add blob storage to CI runners by @mrwyattii in #2260
- Update replace_module.py, test-gptj.py related fix by @molly-smith in #2269
- Fix OrderedDict import for python3.6 by @Dipet in #2267
- Ds inference/fix mp2 by @RezaYazdaniAminabadi in #2270
- Trajepl: nebula load fix by @trajepl in #2182
- Prevent torch ext folder mkdir at tmp by @jeffra in #2274
- Ds-inference Int8 support through ZeroQuant technology by @RezaYazdaniAminabadi in #2217
- add a new unit test for cuda ops by @awan-10 in #2278
- Addition to code owners file by @cmikeh2 in #2279
- Memory Access Utility by @cmikeh2 in #2276
- Fp32 accuracy bug fix by @RezaYazdaniAminabadi in #2285
- Refactor universal checkpointing and tensor fragments by @tjruwase in #2253
- [ds-inference] fix progress bar by @stas00 in #2286
- Offload all gradients to nvme by @tjruwase in #2282
- fused bias relu unittest by @molly-smith in #2297
- Fix for pytest picking up wrong deepspeed by @mrwyattii in #2299
- Fix for Zero3 when MP>1 by @Quentin-Anthony in #2289
- Unit test for bias add kernel by @mrwyattii in #2298
- Update relu.cu with mem_access_utils by @molly-smith in #2306
- Add tensor parallel inference unit tests by @mrwyattii in #2232
- Fix the residual add mp scaling for GPTNeoX by @arashb in #2310
- Add unit tests for residual_add kernel by @arashb in #2307
- add inference eval scripts by @jeffra in #2303
- Upgrade P40 tests to torch 1.8 by @mrwyattii in #2316
- ZeRO-Inference blog by @tjruwase in #2271
- ZeRO-Inference blog - wrap up by @tjruwase in #2321
- ZeRO-Inference blog - Update README by @tjruwase in #2322
- Refactor relu bias add with mem_access utils by @mrwyattii in #2317
- add quant unit test by @GuanhuaWang in #2315
- only override forward if using cuda-graph by @jeffra in #2291
- Add more options to inference benchmark by @mrwyattii in #2325
New Contributors
- @molly-smith made their first contribution in #2269
Full Changelog: v0.7.2...v0.7.3
v0.7.2: Patch release
What's Changed
- Enable contiguous gradients with Z1+MoE by @siddharth9820 in #2250
- Correctly detect CPU optimizer usage by @tjruwase in #2257
- Update Half Precision Kernel Compatibility by @cmikeh2 in #2261
- fix #2240: wrong time unit in flops_profiler by @yzs981130 in #2241
New Contributors
- @cmikeh2 made their first contribution in #2261
- @yzs981130 made their first contribution in #2241
Full Changelog: v0.7.1...v0.7.2