Problem compiling deepmd 3.1.2 from cuda 12.9 #5195
Replies: 1 comment 6 replies
-
|
Your error is most likely caused by a mismatch or incompatibility between your CUDA/cuBLAS, PyTorch, and deepmd-kit builds, or by exceeding GPU parameter limits. First, confirm your PyTorch version is exactly 2.10.0. DeepMD-kit 3.1.2 recently updated its requirements to strictly require PyTorch 2.10.0 for CUDA 12.9 compatibility—other versions may not work reliably and can cause runtime errors like the one you’re seeing (see PR). If your PyTorch version is correct, try reducing your batch size and the selection parameters ( If the error persists, rebuild deepmd-kit after ensuring ABI compatibility with your installed PyTorch. If you built deepmd-kit with a different CXX11_ABI flag than your PyTorch runtime, you can get subtle or cryptic CUDA errors at runtime. Always build deepmd-kit in the same environment as your PyTorch install. If you want a more robust setup, consider installing deepmd-kit via conda or the official shell script. These methods manage CUDA/cuBLAS/PyTorch dependencies more reliably than pip-from-source. Both conda environments and shell script installs can be managed as lmod modules: just create a modulefile that sets the appropriate environment variables (like Let me know your PyTorch version and if these steps help! If not, I can suggest further debugging steps. To reply, just mention @dosu. How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Dear community,
Recently I'm trying to install deepmd-kit 3.1.2 from source using cuda 12.9. The compilation process seems to pass, but when running a test finetuning process to DPA-3 forcefield, I encountered the following error:
[2026-02-06 10:36:22,531] DEEPMD INFO val/mp-2412-GGA 24 6 2 1.504e-03 T
[2026-02-06 10:36:22,531] DEEPMD INFO val/mp-555915-GGA 72 2 3 2.256e-03 T
[2026-02-06 10:36:22,531] DEEPMD INFO val/mp-674158-GGA 1 128 1 7.519e-04 T
[2026-02-06 10:36:22,531] DEEPMD INFO val/mp-83-GGA 56 3 3 2.256e-03 T
[2026-02-06 10:36:22,531] DEEPMD INFO --------------------------------------------------------------------------------------
[2026-02-06 10:36:23,191] DEEPMD INFO Start to train 100 steps.
Traceback (most recent call last):
File "/app/softwares/deepmd/3.1.2-cu129/venv/bin/dp", line 6, in
sys.exit(main())
File "/app/softwares/deepmd/3.1.2-cu129/venv/lib/python3.10/site-packages/deepmd/main.py", line 1020, in main
deepmd_main(args)
File "/app/softwares/deepmd/3.1.2-cu129/venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 362, in wrapper
return f(*args, **kwargs)
File "/app/softwares/deepmd/3.1.2-cu129/venv/lib/python3.10/site-packages/deepmd/pt/entrypoints/main.py", line 536, in main
train(
File "/app/softwares/deepmd/3.1.2-cu129/venv/lib/python3.10/site-packages/deepmd/pt/entrypoints/main.py", line 367, in train
trainer.run()
File "/app/softwares/deepmd/3.1.2-cu129/venv/lib/python3.10/site-packages/deepmd/pt/train/training.py", line 1132, in run
step(step_id)
File "/app/softwares/deepmd/3.1.2-cu129/venv/lib/python3.10/site-packages/deepmd/pt/train/training.py", line 772, in step
model_pred, loss, more_loss = self.wrapper(
File "/app/softwares/deepmd/3.1.2-cu129/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/app/softwares/deepmd/3.1.2-cu129/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
return forward_call(*args, **kwargs)
File "/app/softwares/deepmd/3.1.2-cu129/venv/lib/python3.10/site-packages/deepmd/pt/train/wrapper.py", line 193, in forward
model_pred, loss, more_loss = self.loss[task_key](
File "/app/softwares/deepmd/3.1.2-cu129/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/app/softwares/deepmd/3.1.2-cu129/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
return forward_call(*args, **kwargs)
File "/app/softwares/deepmd/3.1.2-cu129/venv/lib/python3.10/site-packages/deepmd/pt/loss/ener.py", line 186, in forward
model_pred = model(**input_dict)
File "/app/softwares/deepmd/3.1.2-cu129/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/app/softwares/deepmd/3.1.2-cu129/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
return forward_call(*args, **kwargs)
File "/app/softwares/deepmd/3.1.2-cu129/venv/lib/python3.10/site-packages/deepmd/pt/model/model/ener_model.py", line 103, in forward
model_ret = self.forward_common(
File "/app/softwares/deepmd/3.1.2-cu129/venv/lib/python3.10/site-packages/deepmd/pt/model/model/make_model.py", line 175, in forward_common
) = extend_input_and_build_neighbor_list(
File "/app/softwares/deepmd/3.1.2-cu129/venv/lib/python3.10/site-packages/deepmd/pt/utils/nlist.py", line 29, in extend_input_and_build_neighbor_list
coord_normalized = normalize_coord(
File "/app/softwares/deepmd/3.1.2-cu129/venv/lib/python3.10/site-packages/deepmd/pt/utils/region.py", line 103, in normalize_coord
icoord = phys2inter(coord, cell)
File "/app/softwares/deepmd/3.1.2-cu129/venv/lib/python3.10/site-packages/deepmd/pt/utils/region.py", line 25, in phys2inter
return torch.matmul(coord, rec_cell)
RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling
cublasDgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)The training setup I was using:
{ "_comment": "that's all", "model": { "type_map": [ "H", "He", "Li", "Be", "B", "C", "N", "O", "F", "Ne", "Na", "Mg", "Al", "Si", "P", "S", "Cl", "Ar", "K", "Ca", "Sc", "Ti", "V", "Cr", "Mn", "Fe", "Co", "Ni", "Cu", "Zn", "Ga", "Ge", "As", "Se", "Br", "Kr", "Rb", "Sr", "Y", "Zr", "Nb", "Mo", "Tc", "Ru", "Rh", "Pd", "Ag", "Cd", "In", "Sn", "Sb", "Te", "I", "Xe", "Cs", "Ba", "La", "Ce", "Pr", "Nd", "Pm", "Sm", "Eu", "Gd", "Tb", "Dy", "Ho", "Er", "Tm", "Yb", "Lu", "Hf", "Ta", "W", "Re", "Os", "Ir", "Pt", "Au", "Hg", "Tl", "Pb", "Bi", "Po", "At", "Rn", "Fr", "Ra", "Ac", "Th", "Pa", "U", "Np", "Pu", "Am", "Cm", "Bk", "Cf", "Es", "Fm", "Md", "No", "Lr", "Rf", "Db", "Sg", "Bh", "Hs", "Mt", "Ds", "Rg", "Cn", "Nh", "Fl", "Mc", "Lv", "Ts", "Og" ], "descriptor": { "type": "dpa3", "repflow": { "n_dim": 128, "e_dim": 64, "a_dim": 32, "nlayers": 16, "e_rcut": 6.0, "e_rcut_smth": 5.3, "e_sel": 1200, "a_rcut": 4.0, "a_rcut_smth": 3.5, "a_sel": 300, "axis_neuron": 4, "fix_stat_std": 0.3, "a_compress_rate": 1, "a_compress_e_rate": 2, "a_compress_use_split": true, "update_angle": true, "smooth_edge_update": true, "use_dynamic_sel": true, "sel_reduce_factor": 10.0, "use_exp_switch": true, "update_style": "res_residual", "update_residual": 0.1, "update_residual_init": "const" }, "activation_function": "silut:3.0", "use_tebd_bias": false, "precision": "float32", "concat_output_tebd": false }, "fitting_net": { "neuron": [ 240, 240, 240 ], "dim_case_embd": 31, "resnet_dt": true, "precision": "float32", "activation_function": "silut:3.0", "seed": 1, "_comment": " that's all" }, "_comment": " that's all" }, "learning_rate": { "type": "exp", "decay_steps": 20, "start_lr": 0.001, "stop_lr": 3e-05, "_comment": "that's all" }, "loss": { "type": "ener", "start_pref_e": 0.2, "limit_pref_e": 20, "start_pref_f": 100, "limit_pref_f": 60, "start_pref_v": 0.02, "limit_pref_v": 1, "_comment": " that's all" }, "training": { "stat_file": "./dpa3.hdf5", "training_data": { "systems": [ ... ], "batch_size": "auto:128", "_comment": "that's all" }, "validation_data": { "systems": [ ... ], "batch_size": "auto:128", "_comment": "that's all" }, "numb_steps": 100, "warmup_steps": 0, "gradient_max_norm": 5.0, "seed": 10, "disp_file": "lcurve.out", "disp_freq": 100, "save_freq": 2000, "_comment": "that's all" }My GPU: nvidia V100 32G PCIE
My nvidia driver: system-wide nvidia-driver-580-server
My nvcc: from lmod module cuda/12.9, installed from nvidia's official cuda_12.9.0_575.51.03_linux.run
My gcc: system-wide gcc (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0
My torch: installed using pip, as mentioned in your official documentation: https://docs.deepmodeling.com/projects/deepmd/en/stable/install/install-from-source.html
My deepmd-kit source code was not downloaded via git clone, but by using wget to download the tarball from your 3.1.2 github release link instead.
All of my installation commands:
`bash
DP_VARIANT=cuda CUDAToolkit_ROOT=/app/softwares/cuda/12.9/ DP_ENABLE_TENSORFLOW=1 DP_ENABLE_PYTORCH=1 pip install .
`
Could you please make a deep investigation and teach me how may I fix this issue?
Besides, if this issue is unlikely to be fixed under my current setup, I may have to try installing from the shell scripts. However, even if I have to do the shell script install or conda install, would it still be possible to make the whole deepmd-kit environment an lmod managed module (i.e., can be loaded via module load deepmd/3.1.2 and cleanly unloaded via module unload).
Much appreciated!
Beta Was this translation helpful? Give feedback.
All reactions