Awesome
DeepSpeed Version 14.0 with CUDA 12.1 - Installation Instructions:
-
Download the 14.0 release of DeepSpeed 14.0 extract it to a folder.
-
Install Visual C++ build tools, such as VS2019 C++ x64/x86 build tools.
-
Download and install the Nvidia Cuda Toolkit 12.1
-
Edit your Windows environment variables to ensure that CUDA_HOME and CUDA_PATH are set to your Nvidia Cuda Toolkit path. (The folder above the bin folder that nvcc.exe is installed in). Examples are:<br>
set CUDA_HOME=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1
<br>set CUDA_PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1
<br> -
OPTIONAL If you do not have an python environment already created, you can install Miniconda, then at a command prompt, create and activate your environment with:<br>
conda create -n pythonenv python=3.11
<br>activate pythonenv
<br> -
Launch the Command Prompt cmd with Administrator privilege as it requires admin to allow creating symlink folders.
-
Install PyTorch, 2.2.1 with CUDA 12.1 into your Python 3.11 environment e.g:<br>
activate pythonenv
(activate your python environment)<br>conda install pytorch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 pytorch-cuda=12.1 -c pytorch -c nvidia
-
In your python environment check that your CUDA_HOME and CUDA_PATH are still pointing to the correct location.<br>
set
(to list and check the windows environment variables. Refer to step 4 if not) -
Navigate to your deepspeed folder in the Command Prompt:<br>
cd c:\deepspeed
(wherever you extracted it to) -
Modify the following files:<br>
deepspeed-0.14.0/build_win.bat - at the top of the file, add:<br>
set DS_BUILD_EVOFORMER_ATTN=0
set DS_BUILD_CUTLASS_OPS=0
set DS_BUILD_RAGGED_DEVICE_OPS=0
set DS_BUILD_INFERENCE_CORE_OPS=0
deepspeed-0.14.0/csrc/quantization/pt_binding.cpp - lines 244-250 - change to:
std::vector<int64_t> sz_vector(input_vals.sizes().begin(), input_vals.sizes().end());
sz_vector[sz_vector.size() - 1] = sz_vector.back() / devices_per_node; // num of GPU per nodes
at::IntArrayRef sz(sz_vector);
auto output = torch::empty(sz, output_options);
const int elems_per_in_tensor = at::numel(input_vals) / devices_per_node;
const int elems_per_in_group = elems_per_in_tensor / (in_groups / devices_per_node);
const int elems_per_out_group = elems_per_in_tensor / out_groups;
deepspeed-0.14.0/csrc/transformer/inference/csrc/pt_binding.cpp lines 541-542 - change to:
{static_cast<unsigned>(hidden_dim * InferenceContext::Instance().GetMaxTokenLength()),
static_cast<unsigned>(k * InferenceContext::Instance().GetMaxTokenLength()),
lines 550-551 - change to:
{static_cast<unsigned>(hidden_dim * InferenceContext::Instance().GetMaxTokenLength()),
static_cast<unsigned>(k * InferenceContext::Instance().GetMaxTokenLength()),
line 1581 - change to:
at::from_blob(intermediate_ptr, {input.size(0), input.size(1), static_cast<int64_t>(mlp_1_out_neurons)}, options);
deepspeed-0.14.0/deepspeed/env_report.py line 10 - add:
import psutil
line 83 - 100 - change to:
def get_shm_size():
try:
temp_dir = os.getenv('TEMP') or os.getenv('TMP') or os.path.join(os.path.expanduser('~'), 'tmp')
shm_stats = psutil.disk_usage(temp_dir)
shm_size = shm_stats.total
shm_hbytes = human_readable_size(shm_size)
warn = []
if shm_size < 512 * 1024**2:
warn.append(
f" {YELLOW} [WARNING] Shared memory size might be too small, consider increasing it. {END}"
)
# Add additional warnings specific to your use case if needed.
return shm_hbytes, warn
except Exception as e:
return "UNKNOWN", [f"Error getting shared memory size: {e}"]
-
While still in your command line with python environment enabled run:<br>
build_win.bat
-
Once you are done building there should be a .whl file is present in:<br>
deepspeed-0.14.0/dist/
-
Copy that file to the root of your Oobabooga folder and run:<br>
cmd_windows.bat
<br>pip install deepspeed-YOURFILENAME.whl
(Or whichever name your .whl has you just created) -
To check if its working correctly you can type the following:<br>
set CUDA_HOME=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1
<br>set CUDA_PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1
<br> (This is only needed to make the ds_report work and check if its correctly installed, and shouldnt be needed for TTS generation.)<br>bash
<br>ds_report