Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexError: : tuple index out of range while running scripts/train_cityscapes.yml cached_x.grad_fn.next_functions[1][0].variable #169

Open
doulemint opened this issue Oct 23, 2021 · 4 comments

Comments

@doulemint
Copy link

None
None
Global Rank: 0 Local Rank: 0
Global Rank: 1 Local Rank: 1
Torch version: 1.1, 1.10.0+cu102
n scales [0.5, 1.0, 2.0]
dataset = cityscapes
ignore_label = 255
num_classes = 19
cv split val 0 ['val/lindau', 'val/frankfurt', 'val/munster']
mode val found 500 images
cn num_classes 19
cv split train 0 ['train/aachen', 'train/bochum', 'train/bremen', 'train/cologne', 'train/darmstadt', 'train/dusseldorf', 'train/erfurt', 'train/hamburg', 'train/hanover', 'train/jena', 'train/krefeld', 'train/monchengladbach', 'train/strasbourg', 'train/stuttgart', 'train/tubingen', 'train/ulm', 'train/weimar', 'train/zurich']
mode train found 2975 images
cn num_classes 19
Loading centroid file /home/Xiya/semantic-segmentation/assets/uniform_centroids/cityscapes_cv0_tile1024.json
Found 19 centroids
Class Uniform Percentage: 0.5
Class Uniform items per Epoch: 2975
cls 0 len 5866
cls 1 len 5184
cls 2 len 5678
cls 3 len 1312
cls 4 len 1723
cls 5 len 5656
cls 6 len 2769
cls 7 len 4860
cls 8 len 5388
cls 9 len 2440
cls 10 len 4722
cls 11 len 3719
cls 12 len 1239
cls 13 len 5075
cls 14 len 444
cls 15 len 348
cls 16 len 188
cls 17 len 575
cls 18 len 2238
Using Cross Entropy Loss
Loading weights from: checkpoint=/home/Xiya/semantic-segmentation/assets//seg_weights/ocrnet.HRNet_industrious-chicken.pth
=> init weights from normal distribution
=> loading pretrained model /home/Xiya/semantic-segmentation/assets/seg_weights/hrnetv2_w48_imagenet_pretrained.pth
Trunk: hrnetv2
Model params = 72.1M
Selected optimization level O1: Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Skipped loading parameter module.ocr.cls_head.weight
Skipped loading parameter module.ocr.cls_head.bias
Skipped loading parameter module.ocr.aux_head.2.weight
Skipped loading parameter module.ocr.aux_head.2.bias
Skipped loading parameter module.scale_attn.conv0.weight
Skipped loading parameter module.scale_attn.bn0.weight
Skipped loading parameter module.scale_attn.bn0.bias
Skipped loading parameter module.scale_attn.bn0.running_mean
Skipped loading parameter module.scale_attn.bn0.running_var
Skipped loading parameter module.scale_attn.bn0.num_batches_tracked
Skipped loading parameter module.scale_attn.conv1.weight
Skipped loading parameter module.scale_attn.bn1.weight
Skipped loading parameter module.scale_attn.bn1.bias
Skipped loading parameter module.scale_attn.bn1.running_mean
Skipped loading parameter module.scale_attn.bn1.running_var
Skipped loading parameter module.scale_attn.bn1.num_batches_tracked
Skipped loading parameter module.scale_attn.conv2.weight
Class Uniform Percentage: 0.5
Class Uniform items per Epoch: 2975
cls 0 len 5866
cls 1 len 5184
cls 2 len 5678
cls 3 len 1312
cls 4 len 1723
cls 5 len 5656
cls 6 len 2769
cls 7 len 4860
cls 8 len 5388
cls 9 len 2440
cls 10 len 4722
cls 11 len 3719
cls 12 len 1239
cls 13 len 5075
cls 14 len 444
cls 15 len 348
cls 16 len 188
cls 17 len 575
cls 18 len 2238
/home/Xiya/anaconda/envs/py_seg/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects --local_rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

warnings.warn(
WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


/home/Xiya/anaconda/envs/py_seg/lib/python3.8/site-packages/torch/nn/functional.py:3679: UserWarning: The default behavior for interpolate/upsample with float scale_factor changed in 1.6.0 to align with other frameworks/libraries, and now uses scale_factor directly, instead of relying on the computed output size. If you wish to restore the old behavior, please set recompute_scale_factor=True. See the documentation of nn.Upsample for details.
warnings.warn(
/home/Xiya/anaconda/envs/py_seg/lib/python3.8/site-packages/torch/nn/functional.py:3679: UserWarning: The default behavior for interpolate/upsample with float scale_factor changed in 1.6.0 to align with other frameworks/libraries, and now uses scale_factor directly, instead of relying on the computed output size. If you wish to restore the old behavior, please set recompute_scale_factor=True. See the documentation of nn.Upsample for details.
warnings.warn(
Traceback (most recent call last):
File "train.py", line 601, in
main()
File "train.py", line 451, in main
train(train_loader, net, optim, epoch)
File "train.py", line 491, in train
main_loss = net(inputs)
File "/home/Xiya/anaconda/envs/py_seg/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/Xiya/anaconda/envs/py_seg/lib/python3.8/site-packages/apex/parallel/distributed.py", line 560, in forward
result = self.module(*inputs, **kwargs)
File "/home/Xiya/anaconda/envs/py_seg/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/Xiya/semantic-segmentation/network/ocrnet.py", line 334, in forward
Traceback (most recent call last):
File "train.py", line 601, in
main()
File "train.py", line 451, in main
train(train_loader, net, optim, epoch)
File "train.py", line 491, in train
main_loss = net(inputs)
File "/home/Xiya/anaconda/envs/py_seg/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/Xiya/anaconda/envs/py_seg/lib/python3.8/site-packages/apex/parallel/distributed.py", line 560, in forward
result = self.module(*inputs, **kwargs)
File "/home/Xiya/anaconda/envs/py_seg/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/Xiya/semantic-segmentation/network/ocrnet.py", line 334, in forward
return self.two_scale_forward(inputs)
File "/home/Xiya/semantic-segmentation/network/ocrnet.py", line 284, in two_scale_forward
return self.two_scale_forward(inputs)
File "/home/Xiya/semantic-segmentation/network/ocrnet.py", line 284, in two_scale_forward
hi_outs = self._fwd(x_1x)
File "/home/Xiya/semantic-segmentation/network/ocrnet.py", line 173, in _fwd
hi_outs = self._fwd(x_1x)
File "/home/Xiya/semantic-segmentation/network/ocrnet.py", line 173, in _fwd
_, _, high_level_features = self.backbone(x)
File "/home/Xiya/anaconda/envs/py_seg/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
_, _, high_level_features = self.backbone(x)
File "/home/Xiya/anaconda/envs/py_seg/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/Xiya/semantic-segmentation/network/hrnetv2.py", line 400, in forward
return forward_call(*input, **kwargs)
File "/home/Xiya/semantic-segmentation/network/hrnetv2.py", line 400, in forward
x = self.conv1(x_in)
File "/home/Xiya/anaconda/envs/py_seg/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
x = self.conv1(x_in)
File "/home/Xiya/anaconda/envs/py_seg/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/Xiya/anaconda/envs/py_seg/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 446, in forward
return forward_call(*input, **kwargs)
File "/home/Xiya/anaconda/envs/py_seg/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 446, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/Xiya/anaconda/envs/py_seg/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 442, in _conv_forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/Xiya/anaconda/envs/py_seg/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 442, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,return F.conv2d(input, weight, bias, self.stride,

File "/home/Xiya/anaconda/envs/py_seg/lib/python3.8/site-packages/apex/amp/wrap.py", line 21, in wrapper
File "/home/Xiya/anaconda/envs/py_seg/lib/python3.8/site-packages/apex/amp/wrap.py", line 21, in wrapper
args[i] = utils.cached_cast(cast_fn, args[i], handle.cache)args[i] = utils.cached_cast(cast_fn, args[i], handle.cache)

File "/home/Xiya/anaconda/envs/py_seg/lib/python3.8/site-packages/apex/amp/utils.py", line 97, in cached_cast
File "/home/Xiya/anaconda/envs/py_seg/lib/python3.8/site-packages/apex/amp/utils.py", line 97, in cached_cast
if cached_x.grad_fn.next_functions[1][0].variable is not x:if cached_x.grad_fn.next_functions[1][0].variable is not x:

IndexErrorIndexError: : tuple index out of rangetuple index out of range

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 7285) of binary: /home/Xiya/anaconda/envs/py_seg/bin/python
Traceback (most recent call last):
File "/home/Xiya/anaconda/envs/py_seg/lib/python3.8/runpy.py", line 192, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/Xiya/anaconda/envs/py_seg/lib/python3.8/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/Xiya/anaconda/envs/py_seg/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/Xiya/anaconda/envs/py_seg/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/Xiya/anaconda/envs/py_seg/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/Xiya/anaconda/envs/py_seg/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/home/Xiya/anaconda/envs/py_seg/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/Xiya/anaconda/envs/py_seg/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures:
[1]:
time : 2021-10-23_07:58:29
host : ivslab2
rank : 1 (local_rank: 1)
exitcode : -11 (pid: 7286)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 7286

Root Cause (first observed failure):
[0]:
time : 2021-10-23_07:58:29
host : ivslab2
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 7285)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

CUDA version: 10.2
apex: installed with cuda_ext
enable to load the dataset
how to fix it?
has any other met this problem?

@AxMM
Copy link

AxMM commented Oct 24, 2021

Hello,
I never met this problem.
But looks like you are using python3.8
You should try with pytorch=1.3.0 python=3.6

@doulemint
Copy link
Author

@AxMM Big Thanks to you for replying to my question!!!! I switched to python 3.6 and it doesn't work though. Instead, I turned down apex, trying to see where the real question is. It came out tensor type has inconsistency error. Have you ever met this problem?
If I don't use fp16, setting its option false, my code is able to run but keep returning negative loss.
if I use fp16, it keeps report inputs' tensor type is half but net's weights are float tensor type...

I try to change the input's tensor type to float, the code seems to expect the input tensor type to be half type at somewhere..........................
Thank you in advance for extending a hand

@linzhiqiu
Copy link

@AxMM Big Thanks to you for replying to my question!!!! I switched to python 3.6 and it doesn't work though. Instead, I turned down apex, trying to see where the real question is. It came out tensor type has inconsistency error. Have you ever met this problem? If I don't use fp16, setting its option false, my code is able to run but keep returning negative loss. if I use fp16, it keeps report inputs' tensor type is half but net's weights are float tensor type...

I try to change the input's tensor type to float, the code seems to expect the input tensor type to be half type at somewhere.......................... Thank you in advance for extending a hand

are you able to solve this issue? I am encountering the same...

@hamzagorgulu
Copy link

Having the same issue, anyone could solve it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants