Error in DataParallel when trying to train with multiple gpus with torch 1.11

I am getting this error when trying to train with multiple gpus:
 File ./lib/python3.7/site-packages/torch/nn/parallel/replicate.py", line 71, in _broadcast_coalesced_reshape
NotImplementedError: Could not run 'aten::view' with arguments from the 'SparseCUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'aten::view' is only available for these backends: [CPU, CUDA, Meta, QuantizedCPU, QuantizedCUDA, MkldnnCPU, BackendSelect, Python, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradLazy, AutogradXPU, AutogradMLC, AutogradHPU, AutogradNestedTensor, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, Tracer, AutocastCPU, Autocast, Batched, VmapMode, Functionalize].
.
I am trying to run the defualt tenserflow.yml experiment. It works fine on a single gpu. 
Evaluation works on multiple gpus.

I looked at the blocks in the module and could not find anything involving sparsity. I am using pytorch 1.11 with cuda 11.3. any idea why this is? thanks.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in DataParallel when trying to train with multiple gpus with torch 1.11 #37

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Error in DataParallel when trying to train with multiple gpus with torch 1.11 #37

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions