Awesome

DeepSpeed Version 14.0 with CUDA 12.1 - Installation Instructions:

Download the 14.0 release of DeepSpeed 14.0 extract it to a folder.
Install Visual C++ build tools, such as VS2019 C++ x64/x86 build tools.
Download and install the Nvidia Cuda Toolkit 12.1
Edit your Windows environment variables to ensure that CUDA_HOME and CUDA_PATH are set to your Nvidia Cuda Toolkit path. (The folder above the bin folder that nvcc.exe is installed in). Examples are: set CUDA_HOME=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1 set CUDA_PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1
OPTIONAL If you do not have an python environment already created, you can install Miniconda, then at a command prompt, create and activate your environment with: conda create -n pythonenv python=3.11 activate pythonenv
Launch the Command Prompt cmd with Administrator privilege as it requires admin to allow creating symlink folders.
Install PyTorch, 2.2.1 with CUDA 12.1 into your Python 3.11 environment e.g: activate pythonenv (activate your python environment) conda install pytorch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 pytorch-cuda=12.1 -c pytorch -c nvidia
In your python environment check that your CUDA_HOME and CUDA_PATH are still pointing to the correct location. set (to list and check the windows environment variables. Refer to step 4 if not)
Navigate to your deepspeed folder in the Command Prompt: cd c:\deepspeed (wherever you extracted it to)
Modify the following files:

deepspeed-0.14.0/build_win.bat - at the top of the file, add:

set DS_BUILD_EVOFORMER_ATTN=0

set DS_BUILD_CUTLASS_OPS=0
set DS_BUILD_RAGGED_DEVICE_OPS=0
set DS_BUILD_INFERENCE_CORE_OPS=0

deepspeed-0.14.0/csrc/quantization/pt_binding.cpp - lines 244-250 - change to:

    std::vector<int64_t> sz_vector(input_vals.sizes().begin(), input_vals.sizes().end());
    sz_vector[sz_vector.size() - 1] = sz_vector.back() / devices_per_node;  // num of GPU per nodes
    at::IntArrayRef sz(sz_vector);
    auto output = torch::empty(sz, output_options);

    const int elems_per_in_tensor = at::numel(input_vals) / devices_per_node;
    const int elems_per_in_group = elems_per_in_tensor / (in_groups / devices_per_node);
    const int elems_per_out_group = elems_per_in_tensor / out_groups;

deepspeed-0.14.0/csrc/transformer/inference/csrc/pt_binding.cpp lines 541-542 - change to:

									 {static_cast<unsigned>(hidden_dim * InferenceContext::Instance().GetMaxTokenLength()),
									  static_cast<unsigned>(k * InferenceContext::Instance().GetMaxTokenLength()),

lines 550-551 - change to:

						 {static_cast<unsigned>(hidden_dim * InferenceContext::Instance().GetMaxTokenLength()),
						  static_cast<unsigned>(k * InferenceContext::Instance().GetMaxTokenLength()),

line 1581 - change to:

		at::from_blob(intermediate_ptr, {input.size(0), input.size(1), static_cast<int64_t>(mlp_1_out_neurons)}, options);

deepspeed-0.14.0/deepspeed/env_report.py line 10 - add:

import psutil

line 83 - 100 - change to:

def get_shm_size():
    try:
        temp_dir = os.getenv('TEMP') or os.getenv('TMP') or os.path.join(os.path.expanduser('~'), 'tmp')
        shm_stats = psutil.disk_usage(temp_dir)
        shm_size = shm_stats.total
        shm_hbytes = human_readable_size(shm_size)
        warn = []
        if shm_size < 512 * 1024**2:
            warn.append(
                f" {YELLOW} [WARNING] Shared memory size might be too small, consider increasing it. {END}"
            )
            # Add additional warnings specific to your use case if needed.
        return shm_hbytes, warn
    except Exception as e:
        return "UNKNOWN", [f"Error getting shared memory size: {e}"]

While still in your command line with python environment enabled run: build_win.bat
Once you are done building there should be a .whl file is present in: deepspeed-0.14.0/dist/
Copy that file to the root of your Oobabooga folder and run: cmd_windows.bat pip install deepspeed-YOURFILENAME.whl (Or whichever name your .whl has you just created)
To check if its working correctly you can type the following: set CUDA_HOME=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1 set CUDA_PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1 (This is only needed to make the ds_report work and check if its correctly installed, and shouldnt be needed for TTS generation.) bash ds_report